# Dealing with Text Data
  
Finally, in this chapter, you will work with unstructured text data, understanding ways in which you can engineer columnar features out of a text corpus. You will compare how different approaches may impact how much context is being extracted from a text, and how to balance the need for context, without too many features being created.

In [45]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Introduction to Text Encoding
  
So far in this course you have dealt with data that, while sometimes messy, has been generally columnar in nature. When you are faced with text data this is often not going to be the case.
  
**Standardizing your text**
  
Data that is not in a predefined form is called unstructured data, and free text data is a good example of this. Before you can leverage text data in a machine learning model you must first transform it into a series of columns of numbers or vectors. There are many different approaches to doing this and in this chapter we will go through the most common approaches. In this chapter, you will be working with the United States inaugural address dataset, which contains the text for each President's inaugural speech. With George Washington's shown here. It is clear that free text like this is not in tabular form.
  
<img src='../_images/text-encoding-free-text-example.png' text='alt text' width='500'>
  
**Dataset**
  
Before any text analytics can be performed, you must ensure that the text data is in a format that can be used. The speeches have been loaded as a pandas DataFrame called `speech_df`, with the body of the text in the 'text' column as can be seen by looking at the top five rows using the `.head()` method as shown.
  
<img src='../_images/text-encoding-free-text-example1.png' text='alt text' width='500'>
  
**Removing unwanted characters**
  
Most bodies of text will have non letter characters such as punctuation, that will need to be removed before analysis. This can be achieved by using the replace() method along with the str accessor. We have used this in an earlier chapter, but instead of specifying the exact characters you wish to replace, this time you will use patterns called regular expressions. Now unless you go through the text of all speeches, it is difficult to determine which non-letter characters are present in the data. So the easiest way to deal with this to specify a pattern which replaces all non letter characters as shown here. The pattern lowercase a to lowercase z followed by uppercase A to uppercase Z inside square brackets basically indicates include all letter characters. Placing a caret before this pattern inside square brackets negates this, that is, says all non letter characters. So we use the `.replace()` method with this pattern to replace all non letter characters with a white-space as shown here.
  
<img src='../_images/text-encoding-free-text-example2.png' text='alt text' width='500'>
  
**Removing unwanted characters**
  
Here you can see the text of the first speech before and after processing. Notice that the hyphen and the colon are missing.
  
<img src='../_images/text-encoding-free-text-example3.png' text='alt text' width='500'>
  
**Standardize the case**
  
Once all unwanted characters have been removed you will want to standardize the remaining characters in your text so that they are all lower case. This will ensure that the same word with and without capitalization will not be counted as separate words. You can use the `.lower()` method to achieve this as shown here.
  
<img src='../_images/text-encoding-free-text-example4.png' text='alt text' width='500'>
  
**Length of text**
  
Later in this chapter you will work through the creation of features based on the content of different texts, but often there is value in the fundamental characteristics of a passage, such as its length. Using the `.len()` method, you can calculate the number of characters in each speech.
  
<img src='../_images/text-encoding-free-text-example5.png' text='alt text' width='500'>
  
**Word counts**
  
Along with the pure character length of the speech, you may want to know how many words are contained in it. The most straight forward way to do this is to split the speech based an any white-spaces, and then count how many words there are after the split. First, you will need to split the text with with the `.split()` method as shown here and
  
<img src='../_images/text-encoding-free-text-example6.png' text='alt text' width='500'>
  
**Word counts**
  
then chain the `.len()` method to count the total number of words in each speech.
  
<img src='../_images/text-encoding-free-text-example7.png' text='alt text' width='500'>
  
**Average length of word**
  
Finally, one other stat you can calculate is the average word length. Since you already have the total number of characters and the word count, you can simply divide them to obtain the average word length.
  
<img src='../_images/text-encoding-free-text-example8.png' text='alt text' width='500'>

### Cleaning up your text
  
Unstructured text data cannot be directly used in most analyses. Multiple steps need to be taken to go from a long free form string to a set of numeric columns in the right format that can be ingested by a machine learning model. The first step of this process is to standardize the data and eliminate any characters that could cause problems later on in your analytic pipeline.

In this chapter you will be working with a new dataset containing the inaugural speeches of the presidents of the United States loaded as `speech_df`, with the speeches stored in the text column.
  
1. Print the first 5 rows of the `text` column to see the free text fields.
2. Replace all non letter characters in the `text` column with a whitespace.
3. Make all characters in the newly created `text_clean` column lower case.

In [46]:
# Loading the dataframe
speech_df = pd.read_csv('../_datasets/inaugural_speeches.csv')
speech_df.head()

Unnamed: 0,Name,Inaugural Address,Date,text
0,George Washington,First Inaugural Address,"Thursday, April 30, 1789",Fellow-Citizens of the Senate and of the House...
1,George Washington,Second Inaugural Address,"Monday, March 4, 1793",Fellow Citizens: I AM again called upon by th...
2,John Adams,Inaugural Address,"Saturday, March 4, 1797","WHEN it was first perceived, in early times, t..."
3,Thomas Jefferson,First Inaugural Address,"Wednesday, March 4, 1801",Friends and Fellow-Citizens: CALLED upon to u...
4,Thomas Jefferson,Second Inaugural Address,"Monday, March 4, 1805","PROCEEDING, fellow-citizens, to that qualifica..."


In [47]:
# Print the first 5 rows of the text column
speech_df['text'].head()

0    Fellow-Citizens of the Senate and of the House...
1    Fellow Citizens:  I AM again called upon by th...
2    WHEN it was first perceived, in early times, t...
3    Friends and Fellow-Citizens:  CALLED upon to u...
4    PROCEEDING, fellow-citizens, to that qualifica...
Name: text, dtype: object

In [48]:
# Replace all non letter characters with a whitespace
speech_df['text_clean'] = speech_df['text'].str.replace('[^a-zA-Z]', ' ', regex=True)

# Change to lower case
speech_df['text_clean'] = speech_df['text_clean'].str.lower()

# Print the first 5 rows of text_clean column
print(speech_df['text_clean'].head())

0    fellow citizens of the senate and of the house...
1    fellow citizens   i am again called upon by th...
2    when it was first perceived  in early times  t...
3    friends and fellow citizens   called upon to u...
4    proceeding  fellow citizens  to that qualifica...
Name: text_clean, dtype: object


Great, now your text strings have been standardized and cleaned up. You can now use this new column (`text_clean`) to extract information about the speeches.

### High level text features
  
Once the text has been cleaned and standardized you can begin creating features from the data. The most fundamental information you can calculate about free form text is its size, such as its length and number of words. In this exercise (and the rest of this chapter), you will focus on the cleaned/transformed text column (`text_clean`) you created in the last exercise.
  
1. Record the character length of each speech in the `char_count` column.
2. Record the word count of each speech in the `word_count` column.
3. Record the average word length of each speech in the `avg_word_length` column.

In [49]:
# Find the length of each text
speech_df['char_cnt'] = speech_df['text_clean'].str.len()

# Count the number of words in each text
speech_df['word_cnt'] = speech_df['text_clean'].str.split().str.len()

# Find the average length of word
speech_df['avg_word_length'] = speech_df['char_cnt'] / speech_df['word_cnt']

# Print the first 5 rows of these columns
speech_df[['text_clean', 'char_cnt', 'word_cnt', 'avg_word_length']].head()

Unnamed: 0,text_clean,char_cnt,word_cnt,avg_word_length
0,fellow citizens of the senate and of the house...,8616,1432,6.01676
1,fellow citizens i am again called upon by th...,787,135,5.82963
2,when it was first perceived in early times t...,13871,2323,5.971158
3,friends and fellow citizens called upon to u...,10144,1736,5.843318
4,proceeding fellow citizens to that qualifica...,12902,2169,5.948363


These features may appear basic but can be quite useful in ML models.

## Word Count Representation
  
Once high level information has been recorded you can begin creating features based on the actual content of each text.
  
**Text to columns**
  
The most common approach to this is to create a column for each word and record the number of times each particular word appears in each text. This results in a set of columns equal in width to the number of unique words in the dataset, with counts filling each entry. Taking just one sentence we can see that "of" occurs 3 tines, "the" 2 times and the other words once.
  
<img src='../_images/word-count-representation-ml-with-text.png' text='alt text' width='500'>
  
**Initializing the vectorizer**
  
While you could of course write a script to do this counting yourself, scikit-learn already has this functionality built in with its `CountVectorizer` class.
  
`from sklearn.feature_extraction.text import CountVectorizer`  
  
Then instantiate it by assigning it to a variable name, `cv` in this case.  
  
<img src='../_images/word-count-representation-ml-with-text1.png' text='alt text' width='500'>
  
**Specifying the vectorizer**
  
It may have become apparent that creating a column for every word will result in far too many values for analyses. Thankfully, you can specify arguments when initializing your `CountVectorizer` to limit this. For example, you can specify the minimum number of texts that a word must be contained in using the argument `min_df`. If a float is given, the word must appear in at least this percent of documents. This threshold eliminates words that occur so rarely that they would not be useful when generalizing to new texts. Conversely, `max_df` limits words to only ones that occur below a certain percentage of the data. This can be useful to remove words that occur too frequently to be of any value.
  
<img src='../_images/word-count-representation-ml-with-text2.png' text='alt text' width='500'>
  
**Fit the vectorizer**
  
Once the vectorizer has been instantiated you can then fit it on the data you want to create your features around. This is done by calling the `.fit()` method on relevant column.
  
`cv.fit(speech_df['text_clean'])`  
  
**Transforming your text**
  
Once the vectorizer has been fit you can call the `.transform()` method on the column you want to transform. This outputs a sparse array, with a row for every text and a column for every word that has been counted.
  
<img src='../_images/word-count-representation-ml-with-text3.png' text='alt text' width='500'>
  
**Transforming your text**
  
To transform this to a non sparse array you can use the `.toarray()` method.
  
`cv_transformed.toarray()`  
  
**Getting the features**
  
You may notice that the output is an array, with no concept of column names. To get the names of the features that have been generated you can call the `.get_feature_names()` method on the vectorizer which returns a list of the features generated, in the same order that the columns of the converted array are in.
  
<img src='../_images/word-count-representation-ml-with-text4.png' text='alt text' width='500'>
  
**Fitting and transforming**
  
As an aside, while fitting and transforming separately can be useful, particularly when you need to transform a different dataset than the one that you fit the vectorizer to, you can accomplish both steps at once using the `.fit_transform()` method.
  
<img src='../_images/word-count-representation-ml-with-text5.png' text='alt text' width='500'>
  
Putting it all together
  
Now that you have an array containing the count values of each of the words of interest, and a way to get the feature names you can combine these in a DataFrame as shown here. The `.add_prefix()` method allows you to be able to distinguish these columns in the future.
  
<img src='../_images/word-count-representation-ml-with-text6.png' text='alt text' width='500'>
  
`out Counts_aback Counts_abandon Counts_abandonment 0 1 0 0 1 0 0 1 2 0 1 0 3 0 1 0 4 0 0 0 `
  
**Updating your DataFrame**
  
You can now combine this DataFrame with your original DataFrame so they can be used to generate future analytical models using pandas `pd.concat()` method. Checking the DataFrames shape shows the new much wider size. Remember to specify the axis argument to 1 as you want column bind these DataFrames.
  
<img src='../_images/word-count-representation-ml-with-text7.png' text='alt text' width='500'>

### Counting words (I)
  
Once high level information has been recorded you can begin creating features based on the actual content of each text. One way to do this is to approach it in a similar way to how you worked with categorical variables in the earlier lessons.
  
For each unique word in the dataset a column is created.
For each entry, the number of times this word occurs is counted and the count value is entered into the respective column.
  
These "count" columns can then be used to train machine learning models.
  
1. `from sklearn.feature_extraction.text import CountVectorizer`
2. Instantiate `CountVectorizer` and assign it to `cv`.
3. Fit the vectorizer to the `text_clean` column.
4. Print the feature names generated by the vectorizer.

In [50]:
from sklearn.feature_extraction.text import CountVectorizer


# Instantiate CountVectorizer
cv = CountVectorizer()

# Fit the vectorizer
cv.fit(speech_df['text_clean'])

# Print feature names
print(cv.get_feature_names_out()[:10])  # Extracting the first 10 feature names


['abandon' 'abandoned' 'abandonment' 'abate' 'abdicated' 'abeyance'
 'abhorring' 'abide' 'abiding' 'abilities']


Great, this vectorizer can be applied to both the text it was trained on, and new texts.

### Counting words (II)
  
Once the vectorizer has been fit to the data, it can be used to transform the text to an array representing the word counts. This array will have a row per block of text and a column for each of the features generated by the vectorizer that you observed in the last exercise.
  
The vectorizer to you fit in the last exercise (`cv`) is available in your workspace.
  
1. Apply the vectorizer to the `text_clean` column.
2. Convert this transformed (sparse) array into a `numpy` array with counts.
3. Print the dimensions of this `numpy` array.

In [51]:
# Apply the vectorizer
cv_transformed = cv.transform(speech_df['text_clean'])

# Print the full array
cv_array = cv_transformed.toarray()
print(cv_array)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 ...
 [0 1 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [52]:
# Print the shape of cv_array
print(cv_array.shape)

(58, 9043)


The speeches have 9043 unique words, which is a lot! In the next exercise, you will see how to create a limited set of features.

### Limiting your features
  
As you have seen, using the `CountVectorizer` with its default settings creates a feature for every single word in your corpus. This can create far too many features, often including ones that will provide very little analytical value.
  
For this purpose `CountVectorizer` has parameters that you can set to reduce the number of features:
  
- `min_df=` : Use only words that occur in more than this percentage of documents. This can be used to remove outlier words that will not generalize across texts.
  
- `max_df=` : Use only words that occur in less than this percentage of documents. This is useful to eliminate very common words that occur in every corpus without adding value such as "and" or "the".
  
1. Limit the number of features in the `CountVectorizer` by setting the minimum number of documents a word can appear to 20% and the maximum to 80%.
2. Fit and apply the vectorizer on `text_clean` column in one step.
3. Convert this transformed (sparse) array into a `numpy` array with counts.
4. Print the dimensions of the new reduced array.

In [53]:
from sklearn.feature_extraction.text import CountVectorizer


# Specify arguments to limit the number of features generated
cv = CountVectorizer(min_df=0.2, max_df=0.8)  # Word must appear at least 20% of the time in courpus, also can not appear more than 80% of the time

# Fit, transform, and convert into array
cv_transformed = cv.fit_transform(speech_df['text_clean'])
cv_array = cv_transformed.toarray()

# Print the array shape
print(cv_array.shape)
print(cv_array)

(58, 818)
[[ 0  0  0 ...  5  0  9]
 [ 0  0  0 ...  0  0  1]
 [ 0  0  0 ...  0  0  1]
 ...
 [ 0  1  0 ... 14  1  3]
 [ 0  0  0 ...  5  1  0]
 [ 0  0  0 ... 14  1 11]]


Did you notice that the number of features (unique words) greatly reduced from 9043 to 818?

### Text to DataFrame
  
Now that you have generated these count based features in an array you will need to reformat them so that they can be combined with the rest of the dataset. This can be achieved by converting the array into a `pandas` DataFrame, with the feature names you found earlier as the column names, and then concatenate it with the original DataFrame.
  
The numpy array (`cv_array`) and the vectorizer (`cv`) you fit in the last exercise are available in your workspace.
  
1. Create a DataFrame `cv_df` containing the `cv_array` as the values and the feature names as the column names.
2. Add the prefix `Counts_` to the column names for ease of identification.
3. Concatenate this DataFrame (`cv_df`) to the original DataFrame (`speech_df`) column wise.

In [54]:
# Create a DataFrame with the feature names
cv_df = pd.DataFrame(cv_array, columns = cv.get_feature_names_out()).add_prefix('Counts_')

# Add the new columns to the original DataFrame
speech_df_new = pd.concat([speech_df, cv_df], axis=1, sort=False)
speech_df_new.head()

Unnamed: 0,Name,Inaugural Address,Date,text,text_clean,char_cnt,word_cnt,avg_word_length,Counts_abiding,Counts_ability,...,Counts_women,Counts_words,Counts_work,Counts_wrong,Counts_year,Counts_years,Counts_yet,Counts_you,Counts_young,Counts_your
0,George Washington,First Inaugural Address,"Thursday, April 30, 1789",Fellow-Citizens of the Senate and of the House...,fellow citizens of the senate and of the house...,8616,1432,6.01676,0,0,...,0,0,0,0,0,1,0,5,0,9
1,George Washington,Second Inaugural Address,"Monday, March 4, 1793",Fellow Citizens: I AM again called upon by th...,fellow citizens i am again called upon by th...,787,135,5.82963,0,0,...,0,0,0,0,0,0,0,0,0,1
2,John Adams,Inaugural Address,"Saturday, March 4, 1797","WHEN it was first perceived, in early times, t...",when it was first perceived in early times t...,13871,2323,5.971158,0,0,...,0,0,0,0,2,3,0,0,0,1
3,Thomas Jefferson,First Inaugural Address,"Wednesday, March 4, 1801",Friends and Fellow-Citizens: CALLED upon to u...,friends and fellow citizens called upon to u...,10144,1736,5.843318,0,0,...,0,0,1,2,0,0,2,7,0,7
4,Thomas Jefferson,Second Inaugural Address,"Monday, March 4, 1805","PROCEEDING, fellow-citizens, to that qualifica...",proceeding fellow citizens to that qualifica...,12902,2169,5.948363,0,0,...,0,0,0,0,2,2,2,4,0,4


With the new features combined with the orginial DataFrame they can be now used for ML models or analysis.

## Tf-Idf Representation
  
While counts of occurrences of words can be a good first step towards encoding your text to build models, it has some limitations. The main issue is counts will be much higher for very common even when they occur across all texts, providing little value as a distinguishing feature.
  
**Introducing TF-IDF**
  
Take for example the counts of the word "the" shown here, with plentiful occurrences in every row. To limit these common words from overpowering your model some form of normalization can be used. One of the most effective approaches to do this is called "*Term Frequency Inverse Document Frequency*" or TF-IDF.
  
<img src='../_images/tf-idf-dealing-with-text-data-ml.png' text='alt text' width='500'>
  
**TF-IDF**
  
TF-IDF divides number of times a word occurs in the document by a measure of what proportion of the documents a word occurs in all documents. This has the effect of reducing the value of common words, while increasing the weight of words that do not occur in many documents.
  
$\Large \text{TF-IDF} = \frac{\frac{{\text{(count of word occurrences)}}}{{\text{(total words in a document)}}}} {\log{\left( \frac{{\text{(number of documents word is in)}}}{{\text{(total number of documents)}}} \right)}}$
  
**Importing the vectorizer**
  
To use a TF-IDF vectorizer, the approach is very similar to how you applied a count vectorizer.  
  
`from sklearn.feature_extraction.text import TfidfVectorizer`  
  
Then assign it to a variable name. Lets use `tv` in this case.
  
<img src='../_images/tf-idf-dealing-with-text-data-ml1.png' text='alt text' width='500'>
  
**Max features and stopwords**
  
Similar to when you were working with the `CountVectorizer` where you could limit the number of features created by specifying arguments when initializing `TfidfVectorizer`, you can specify the maximum number of features using `max_features=` which will only use the 100 most common words. We will also specify the vectorizer to omit a set of `stop_words=`, these are a predefined list of the most common english words such as "and" or "the". You can use scikit-learn's built in list, load your own, or use lists provided by other python libraries.
  
**Fitting your text**
  
Once the vectorizer has been specified you can fit it, and apply it to the text that you want to transform. Note that here we are fitting and transforming the train data, a subset of the original data.
  
<img src='../_images/tf-idf-dealing-with-text-data-ml2.png' text='alt text' width='500'>
  
**Putting it all together**
  
As before, you combine the TF-IDF values along with the feature names in a DataFrame as shown here.
  
<img src='../_images/tf-idf-dealing-with-text-data-ml3.png' text='alt text' width='500'>
  
**Inspecting your transforms**
  
After transforming your data you should always check how the different words are being valued, and see which words are receiving the highest scores through the process. This will help you understand if the features being generated make sense or not. One ad-hoc method is to isolate a single row of the transformed DataFrame (`tv_df` in this case), using the `.iloc[]` accessor, and then sorting the values in the row in descending order as shown here. These top ranked values make sense for the text of a presidential speech.
  
<img src='../_images/tf-idf-dealing-with-text-data-ml4.png' text='alt text' width='500'>
  
**Applying the vectorizer to new data**
  
So how do you apply this transformation on the test set? As mentioned before, you should preprocess your test data using the transformations made on the train data only. To ensure that the same features are created you should use the same vectorizer that you fit on the training data. So first transform the test data using the `tv` vectorizer and then recreate the test dataset by combining the TF-IDF values, feature names, and other columns.
  
<img src='../_images/tf-idf-dealing-with-text-data-ml5.png' text='alt text' width='500'>

## Tf-idf
  
While counts of occurrences of words can be useful to build models, words that occur many times may skew the results undesirably. To limit these common words from overpowering your model a form of normalization can be used. In this lesson you will be using *Term frequency-inverse document frequency* (Tf-idf) as was discussed in the video.  
  
**Tf-idf has the effect of reducing the value of common words, while increasing the weight of words that do not occur in many documents.**
  
1. `from sklearn.feature_extraction.text import TfidfVectorizer`
2. Instantiate `TfidfVectorizer` while limiting the number of features to 100 and removing English stop words.
3. Fit and apply the vectorizer on `text_clean` column in one step.
4. Create a DataFrame `tv_df` containing the weights of the words and the feature names as the column names.

In [55]:
from sklearn.feature_extraction.text import TfidfVectorizer


# Instantiate TfidfVectorizer
tv = TfidfVectorizer(max_features=100, stop_words='english')  # Only use 100 most common words, stop_words are in english

# Fit the vectorizer and transform the data
tv_transformed = tv.fit_transform(speech_df['text_clean'])

# Create a DataFrame with these features
tv_df = pd.DataFrame(tv_transformed.toarray(), columns=tv.get_feature_names_out()).add_prefix('TFIDF_')
tv_df.head()

Unnamed: 0,TFIDF_action,TFIDF_administration,TFIDF_america,TFIDF_american,TFIDF_americans,TFIDF_believe,TFIDF_best,TFIDF_better,TFIDF_change,TFIDF_citizens,...,TFIDF_things,TFIDF_time,TFIDF_today,TFIDF_union,TFIDF_united,TFIDF_war,TFIDF_way,TFIDF_work,TFIDF_world,TFIDF_years
0,0.0,0.133415,0.0,0.105388,0.0,0.0,0.0,0.0,0.0,0.229644,...,0.0,0.045929,0.0,0.136012,0.203593,0.0,0.060755,0.0,0.045929,0.052694
1,0.0,0.261016,0.266097,0.0,0.0,0.0,0.0,0.0,0.0,0.179712,...,0.0,0.0,0.0,0.0,0.199157,0.0,0.0,0.0,0.0,0.0
2,0.0,0.092436,0.157058,0.073018,0.0,0.0,0.026112,0.06046,0.0,0.106072,...,0.03203,0.021214,0.0,0.062823,0.070529,0.024339,0.0,0.0,0.063643,0.073018
3,0.0,0.092693,0.0,0.0,0.0,0.090942,0.117831,0.045471,0.053335,0.223369,...,0.048179,0.0,0.0,0.094497,0.0,0.03661,0.0,0.039277,0.095729,0.0
4,0.041334,0.039761,0.0,0.031408,0.0,0.0,0.067393,0.039011,0.091514,0.27376,...,0.082667,0.164256,0.0,0.121605,0.030338,0.094225,0.0,0.0,0.054752,0.062817


Did you notice that counting the word occurences and calculating the Tf-idf weights are very similar? This is one of the reasons scikit-learn is very popular, a consistent API.

### Inspecting Tf-idf values
  
After creating Tf-idf features you will often want to understand what are the most highest scored words for each corpus. This can be achieved by isolating the row you want to examine and then sorting the the scores from high to low.
  
The DataFrame from the last exercise (`tv_df`) is available in your workspace.
  
1. Assign the first row of `tv_df` to `sample_row`.
2. `sample_row` is now a series of weights assigned to words. Sort these values to print the top 5 highest-rated words.

In [56]:
# Isolate the row to be examined
sample_row = tv_df.iloc[0]  # .iloc[] goes by rows initially, and first row is index=0

# Print the top 5 words of the sorted output
print(sample_row.sort_values(ascending=False).head())

TFIDF_government    0.367430
TFIDF_public        0.333237
TFIDF_present       0.315182
TFIDF_duty          0.238637
TFIDF_country       0.229644
Name: 0, dtype: float64


Do you think these scores make sense for the corresponding words?
  
Yes, as the corpus we have is presidential inaugural speeches, so the presence of words like 'government', 'public', and 'country' to name a few make sense to occur as frequently as they do. With occurance frequencies between 23% and 36.7%.

### Transforming unseen data
  
When creating vectors from text, any transformations that you perform before training a machine learning model, you also need to apply on the new unseen (test) data. To achieve this follow the same approach from the last chapter: *fit the vectorizer only on the training data, and apply it to the test data*.
  
For this exercise the `speech_df` DataFrame has been split in two:
  
`train_speech_df` : The training set consisting of the first 45 speeches.  
`test_speech_df` : The test set consisting of the remaining speeches.  
  
1. Instantiate `TfidfVectorizer`.
2. Fit the vectorizer and apply it to the `text_clean` column.
3. Apply the same vectorizer on the `text_clean` column of the test data.
4. Create a DataFrame of these new features from the test set.

In [57]:
# Train/test split
train_speech_df = speech_df.iloc[:45]   # First 45 speeches, all rows up to 45th, 0-based indexing (0 to 44)
test_speech_df = speech_df.iloc[45:]    # Starting at the 45th index take the remaining rows from there

In [58]:
from sklearn.feature_extraction.text import TfidfVectorizer


# Instantiate TfidfVectorizer
tv = TfidfVectorizer(max_features=100, stop_words='english')

# Fit the vectorizer to the training data and transform it
tv_transformed = tv.fit_transform(train_speech_df['text_clean'])

# Transform test data
test_tv_transformed = tv.transform(test_speech_df['text_clean'])

# Create new features for the test set
test_tv_df = pd.DataFrame(test_tv_transformed.toarray(), columns=tv.get_feature_names_out()).add_prefix('TFIDF_')
test_tv_df.head()

Unnamed: 0,TFIDF_action,TFIDF_administration,TFIDF_america,TFIDF_american,TFIDF_authority,TFIDF_best,TFIDF_business,TFIDF_citizens,TFIDF_commerce,TFIDF_common,...,TFIDF_subject,TFIDF_support,TFIDF_time,TFIDF_union,TFIDF_united,TFIDF_war,TFIDF_way,TFIDF_work,TFIDF_world,TFIDF_years
0,0.0,0.02954,0.233954,0.082703,0.0,0.0,0.0,0.022577,0.0,0.0,...,0.0,0.0,0.115378,0.0,0.024648,0.07905,0.033313,0.0,0.299983,0.134749
1,0.0,0.0,0.547457,0.036862,0.0,0.036036,0.0,0.015094,0.0,0.0,...,0.0,0.019296,0.092567,0.0,0.0,0.052851,0.066817,0.078999,0.277701,0.126126
2,0.0,0.0,0.126987,0.134669,0.0,0.131652,0.0,0.0,0.0,0.046997,...,0.0,0.0,0.075151,0.0,0.080272,0.042907,0.054245,0.096203,0.225452,0.043884
3,0.037094,0.067428,0.267012,0.031463,0.03999,0.061516,0.050085,0.077301,0.0,0.0,...,0.0,0.098819,0.21069,0.0,0.056262,0.030073,0.03802,0.235998,0.237026,0.061516
4,0.0,0.0,0.221561,0.156644,0.028442,0.087505,0.0,0.109959,0.0,0.023428,...,0.0,0.023428,0.187313,0.131913,0.040016,0.021389,0.081124,0.119894,0.299701,0.153133


Correct, the vectorizer should only be fit on the train set, never on your test set.

## Bag of words and N-grams
  
So far you have looked at individual words on their own without any context or word order, this approach is called a bag-of-words model, as the words are treated as if they are being drawn from a bag with no concept of order or grammar. While analyzing the occurrences of individual words can be a valuable way to create features from a piece of text, you will notice that individual words can loose all their context/meaning when viewed independently.
  
**Issues with bag of words**
  
Take for example the word 'happy' shown here. One would assume it was used in a positive context, but if in reality it was used in the phrase 'not happy' this assumption would be incorrect. Similarly if the phrase was extended to 'never not happy' the connotation changes again. One common method to retain at least some concept of word order in a text is to instead use multiple consecutive words like pairs (bi-gram) or three consecutive words (tri-grams). This maintains at least some ordering information while at the same time allowing for the creation of a reasonable set of features.
  
<img src='../_images/issues-with-bag-of-words-n-grams.png' text='alt text' width='500'>
  
**Using N-grams**
  
To leverage n-grams in your own models an additional argument "ngram_range", can be specified when instantiating your TF-IDF vectorizer. The values assigned to the argument are the minimum and maximum length of n-grams to be included. In this case you would only be looking at bi-grams (n-grams with two words) Printing the bi-gram features created we can see the pairs of words instead of single words.
  
<img src='../_images/issues-with-bag-of-words-n-grams2.png' text='alt text' width='500'>
  
**Finding common words**
  
As mentioned in the last video, when creating new features, you should always take time to check your work, and ensure that the features you are creating make sense. A good way to check your n-grams is to see what are the most common values being recorded. This can be done by summing the values of your DataFrame of count values that you created using the `.sum()` method.
  
<img src='../_images/issues-with-bag-of-words-n-grams3.png' text='alt text' width='500'>
  
**Finding common words**
  
After sorting the values in descending order you can see the most commonly occurring values. It comes as no surprise that the most commonly occurring bi-gram in a dataset of US president's speeches is "United States" which indicates that the features being created make sense.
  
<img src='../_images/issues-with-bag-of-words-n-grams4.png' text='alt text' width='500'>

### Using longer n-grams
  
So far you have created features based on individual words in each of the texts. This can be quite powerful when used in a machine learning model but you may be concerned that by looking at words individually a lot of the context is being ignored. To deal with this when creating models you can use n-grams which are sequence of $n$ words grouped together. 
  
For example:
  
**bigrams**: Sequences of two consecutive words  
**trigrams**: Sequences of two consecutive words  
  
These can be automatically created in your dataset by specifying the `ngram_range=` argument as a tuple (`n1`, `n2`) where all n-grams in the `n1` to `n2` range are included.
  
1. `from sklearn.feature_extraction.text import CountVectorizer`
2. Instantiate `CountVectorizer` while considering only trigrams.
3. Fit the vectorizer and apply it to the `text_clean` column in one step.
4. Print the feature names generated by the vectorizer.

In [59]:
# Instantiate a trigram vectorizer
cv_trigram_vec = CountVectorizer(
    max_features=100, 
    stop_words='english', 
    ngram_range=(3, 3)
)

# Fit and apply trigram vectorizer
cv_trigram = cv_trigram_vec.fit_transform(speech_df['text_clean'])

# Print the trigram features
cv_trigram_vec.get_feature_names_out()[:10]

array(['ability preserve protect', 'agriculture commerce manufactures',
       'america ideal freedom', 'amity mutual concession',
       'anchor peace home', 'ask bow heads', 'best ability preserve',
       'best interests country', 'bless god bless', 'bless united states'],
      dtype=object)

Here you can see that by taking sequential word pairings, some context is preserved.

### Finding the most common words
  
Its always advisable once you have created your features to inspect them to ensure that they are as you would expect. This will allow you to catch errors early, and perhaps influence what further feature engineering you will need to do.
  
The vectorizer (`cv`) you fit in the last exercise and the sparse array consisting of word counts (`cv_trigram`) is available in your workspace.
  
1. Create a DataFrame of the features (word counts).
2. Add the counts of word occurrences and print the top 5 most occurring words.

In [60]:
# Create a DataFrame of the features
cv_tri_df = pd.DataFrame(cv_trigram.toarray(), columns=cv_trigram_vec.get_feature_names_out()).add_prefix('Counts_')

# Print the top 5 words in the sorted descending output
cv_tri_df.sum().sort_values(ascending=False).head()

Counts_constitution united states    20
Counts_people united states          13
Counts_mr chief justice              10
Counts_preserve protect defend       10
Counts_president united states        8
dtype: int64

Great, that the most common trigram is constitution united states makes a lot of sense for US presidents speeches.