<a href="https://colab.research.google.com/github/zerotodeeplearning/ztdl-masterclasses/blob/master/notebooks/Bag_of_Words_Features_for_Text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Learn with us: www.zerotodeeplearning.com

Copyright © 2021: Zero to Deep Learning ® Catalit LLC.

In [None]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Bag of Words Features for Text

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
url = "https://raw.githubusercontent.com/zerotodeeplearning/ztdl-masterclasses/master/data/"

In [None]:
df = pd.read_csv(url + 'wikipedia_languages.csv')
df.head()

In [None]:
classes = df['language'].unique()
classes

In [None]:
for language in classes:
  print(df[df['language'] == language].head())
  print()

In [None]:
df['language'].value_counts()

In [None]:
df.info()

### Classification based on alphabet

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
df_train, df_test = train_test_split(df, test_size=0.15, random_state=0)
docs_train = df_train['sentence']
docs_test = df_test['sentence']
y_train = df_train['language']
y_test = df_test['language']

In [None]:
all_text = df_train.groupby('language')['sentence'].agg('sum')

In [None]:
all_text

In [None]:
world_alphabets = []
for language in classes:
  list_of_chars = list(all_text.loc[language])
  top_chars_counts = pd.Series(list_of_chars).value_counts().head(20)
  top_chars_list = list(top_chars_counts.index)
  world_alphabets.extend(top_chars_list)

In [None]:
unique_letters = np.unique(world_alphabets)
len(unique_letters)

In [None]:
cnt_vect = CountVectorizer(analyzer='char', 
                           vocabulary=unique_letters)

In [None]:
model = make_pipeline(cnt_vect,
                      LogisticRegression(solver='liblinear'))

In [None]:
def display_language(language):
  samples = df.loc[df['language'] == language, 'sentence'].iloc[:150]
  features = cnt_vect.transform(samples)
  plt.imshow(features.todense())
  plt.title(language)
  plt.axis('off')

In [None]:
plt.figure(figsize=(10, 7))
for i, language in enumerate(classes):
  plt.subplot(4, 5, i+1)
  display_language(language)

plt.tight_layout()

In [None]:
model.fit(docs_train, y_train)

In [None]:
model.score(docs_train, y_train)

In [None]:
model.score(docs_test, y_test)

### Exercise 1: TFIDF Vectorizer

The classification based on alphabet worked, but didn't get great results. Can we improve it using TFIDF?

- Build a new model that uses the `TfidfVectorizer` to vectorize the text
- Configure the `TfidfVectorizer` to analyze the text by characters, using character ngrams of 1 to 3 characters, you may also introduce a limit on the maximum number of features
- Use a pipeline with an estimator of your choice and train and evaluate the model on training and test set. What's the highest score you can get?


Your code will look like:
```python
tfidf_vect = TfidfVectorizer(# YOUR CODE HERE
)

model = make_pipeline(# YOUR CODE HERE
    
# YOUR CODE HERE
```

### Exercise 2: Investigation of results

Let's dig deeper into the results we got.

- Use the model to predict the labels on `docs_test`
- Use a `classification_report` to inspect the precision and recall of each language, which languages work and which do not?
- Dig deeper into the results by displaying a `confusion_matrix`. Which languages get mixed?
- Bonus points if you can display the confusion matrix nicely with Pandas
- Inspect some of the confused items. Use numpy to select the rows in `docs_test` for which 2 languages are confused. Can you see what the problem is? Are the labels accurate?