# Introduction to Text Mining Part 1 - Exercises with Answers

## Exercise 1

#### Task 1
##### Import the required packages.
##### Set `main_dir` to the location of your `booz-allen-hamilton` folder.
##### Make `data_dir` from the `main_dir` and concatenate remainder of the path to data directory.
##### Set the working directory to `data_dir`.
##### Check if the working directory is updated to `data_dir`.

#### Result:

In [1]:
# Helper packages.
import os
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt

# Packages with tools for text processing.
import nltk
import nltk.data
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer

In [6]:
from pathlib import Path
# Set `home_dir` to the root directory of your computer.
home_dir = Path.home()

# Set `main_dir` to the location of your `booz-allen-hamilton` folder.
main_dir = home_dir / "Documents" / "NLP_Intro" / "intro-to-text-mining-main"
data_dir = main_dir / "Data"

In [7]:
# Change the working directory.
os.chdir(data_dir)

# Check the working directory.
print(os.getcwd())

/Users/amirmokhtari/Documents/NLP_Intro/intro-to-text-mining-main/data


In [8]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/amirmokhtari/nltk_data...


True

#### Task 2
#####  Load the corpus from `UN_agreement_titles.csv` into a new variable `agreements`.
#####  Print the columns of `agreements`.
#####  Print the first 5 rows and check the output to see if data is loaded correctly.

#### Result:

In [9]:
# Load corpus from a text document
agreements  = pd.read_csv(data_dir + '/UN_agreement_titles.csv')

TypeError: unsupported operand type(s) for +: 'PosixPath' and 'str'

In [10]:
# Print the columns of `agreements`.
print(agreements.columns)

NameError: name 'agreements' is not defined

In [11]:
# Print the first 5 rows.
print(agreements.head())

NameError: name 'agreements' is not defined

#### Task 3
##### Make a series from the dataframe that contains only the `title` column of `agreements` and name it `titles`.
##### Print the first 5 titles. 

#### Result:

In [None]:
# Create a series from the dataframe, name it `titles`.
titles = agreements["title"]
print(titles[:5])

## Exercise 2

#### Task 1
##### Tokenize each title in the series `titles` and assign it to `titles_tokenized`.
##### Assign the first tokenized titles to `title_words` and print this out.

##### Note: If you run into look-up error while using word_tokenize, `install punkt from nltk` using the code below

In [None]:
nltk.download('punkt')

#### Result:

In [None]:
# Tokenize each title into a large list of tokenized titles.
titles_tokenized = [word_tokenize(titles[i]) for i in range(0,len(titles))]

# First tokenized title.
titles_words = titles_tokenized[0]
print(titles_words)

#### Task 2
##### Clean the `titles_words` in the following order:
##### 1. Convert all characters to lower case and assign it to `titles_words`.
##### 2. Remove stop words from `titles_words` and assign it to `titles_words`.
##### 3. Remove punctuation, numbers, and all other symbols that are not letters of the alphabet 
#####    from `titles_words` and assign it to `titles_words`.
##### 4. Stem words in `titles_words` and assign it to `titles_words`.

##### Note: If you run into look-up error while using stopwords, `install stopwords from nltk` using the code below.

In [None]:
nltk.download('stopwords')

#### Result:

In [None]:
# 1. Convert to lower case.
titles_words = [word.lower() for word in titles_words]
print(titles_words[:10])

In [None]:
# 2. Remove stop words.
# Get common English stop words.

stop_words = stopwords.words('english')
print(stop_words[:10])

In [None]:
# Remove stop words.
titles_words = [word for word in titles_words if not word in stop_words]
print(titles_words[:10])

In [None]:
# 3. Remove punctuation and any non-alphabetical characters.
titles_words = [word for word in titles_words if word.isalpha()]
print(titles_words[:10])

In [None]:
# 4. Stem words.
titles_words = [PorterStemmer().stem(word) for word in titles_words]
print(titles_words[:10])

#### Task 3
##### Create an empty list `titles_clean` whose length is same as `titles_tokenized`.
##### Perform the above steps on the list `titles_tokenized` and also record the length of each title in 'word_counts_per_titles'.
##### Check the first 10 words in 300th title.

#### Result:

In [None]:
# Create a vector for clean titles.
titles_clean = [None] * len(titles_tokenized)

# Create a vector of word counts for each clean titles.
word_counts_per_titles = [None] * len(titles_tokenized)

# Process words in all documents.
for i in range(len(titles_tokenized)):
    # 1. Convert to lower case.
    titles_clean[i] = [titles.lower() for titles in titles_tokenized[i]]
    
    # 2. Remove stopwords.
    titles_clean[i] = [word for word in titles_clean[i] if not word in stop_words]
    
    # 3. Remove punctuation and any non-alphabetical characters.
    titles_clean[i] = [word for word in titles_clean[i] if word.isalpha()]
    
    # 4. Stem words.
    titles_clean[i] = [PorterStemmer().stem(word) for word in titles_clean[i]]
    
    # Record the word count per titles.
    word_counts_per_titles[i] = len(titles_clean[i])

In [None]:
# First 10 words in 300th titles.
# Index will be 299 since the first row has index of 0.
print(titles_clean[299][:10])

#### Task 4
##### Print the first 10 rows of `word_counts_per_titles` .
##### Plot a histogram for  `word_counts_per_titles`, and set bins to number of unique values in the list.

#### Result:

In [None]:
# Let's take a look at total word counts per title (for the first 10).
print(word_counts_per_titles[:10])

In [None]:
# Build the histogram.
plt.hist(word_counts_per_titles, bins = len(set(word_counts_per_titles)))
plt.xlabel('Number of words per titles')
plt.ylabel('Frequency')

#### Task 5
##### Convert word counts list and snippets list to numpy arrays named `ex_word_counts_array` and  `titles_array` and print the length of  `titles_array`.
##### Find indices of all snippets where there are greater than or equal to 3 words and save it to `valid_titles`. Print length of `valid_titles`.
##### Subset the `titles_array` to keep only those where there are at least 3 words. Print length of `titles_array`.
##### Convert it back to a list `titles_clean`.  Print first 5 rows of `titles_clean`.
##### Combine word tokens in each titles into a single string and save the result as a list called `titles_clean_list`. Print the first 5 titles in `titles_clean_list`.

#### Result:

In [None]:
# Array with length of each titles.
ex_word_counts_array = np.array(word_counts_per_titles)
titles_array = np.array(titles_clean)
print(len(titles_array))

In [None]:
# Find indices of all messages where there are at least 3 words.
valid_titles = np.where(ex_word_counts_array >= 3)[0]
print(len(valid_titles))

In [None]:
# Subset the titles array to keep only those where there are at least 3 words.
titles_array = titles_array[valid_titles]
print(len(titles_array))

In [None]:
# Convert the array back to a list.
titles_clean = titles_array.tolist()
print(titles_clean[:5])

In [None]:
# Join words in each message into a single character string.
titles_clean_list = [' '.join(message) for message in titles_clean]
print(titles_clean_list[:5])

#### Task 6
##### Use the function we defined in class that takes a list of character strings and a name of an output file and writes it into a txt file.

In [None]:
# Define function.
def write_lines(lines, filename):   #<- given lines to write and filename
    joined_lines = '\n'.join(lines) #<- join lines with line breaks
    file = open(ex_out_filename, 'w')  #<- open write only file 
    file.write(joined_lines)        #<- write lines to file
    file.close()                    #<- close connection

##### Save output file name to a variable `ex_out_filename` and call the text file "ex_clean_titles.txt".

#### Result:

In [None]:
# Save file name to a variable.
ex_out_filename = "ex_clean_titles.txt"

In [None]:
# Write sequences to file.
write_lines(titles_clean_list, ex_out_filename)

## Exercise 3

#### Task 1
##### Create a `CountVectorizer()` and save it as `ex_vec`.
##### Create a DTM of the `titles_clean_list` and name it `ex_X`.
##### Convert `ex_X` to an array.
##### Print the  first 20 feature names of `ex_vec`.
##### Convert `ex_X` to a pandas dataframe `ex_DTM` and print the top 5 lines.

#### Result:

In [None]:
ex_vec = CountVectorizer()
ex_X = ex_vec.fit_transform(titles_clean_list)
print(ex_X.toarray())

In [None]:
print(ex_vec.get_feature_names()[:20])

In [None]:
# Convert the matrix into a pandas dataframe for easier manipulation.
ex_DTM = pd.DataFrame(ex_X.toarray(), columns = ex_vec.get_feature_names())
print(ex_DTM.head())

#### Task 2
##### Use the convenience function that sorts and looks at first n-entries in the dictionary we defined in class.

In [None]:
def HeadDict(dict_x, n):
    # Get items from the dictionary and sort them by
    # value key in descending (i.e. reverse) order.
    sorted_x = sorted(dict_x.items(),
    reverse = True,
    key = lambda kv: kv[1])
    # Convert sorted dictionary to a list.
    dict_x_list = list(sorted_x)
    # Return the first `n` values from the dictionary only.
    return(dict(dict_x_list[:n]))

#####  Sum the counts of each word in all documents and save the series as a dictionary `ex_corpus_freq_dist`.
##### Print the top 30 words and their counts in `ex_corpus_freq_dist`.

#### Result:

In [None]:
# Sum frequencies of each word in all documents.
ex_DTM.sum(axis = 0).head()

# Save series as a dictonary.
ex_corpus_freq_dist = ex_DTM.sum(axis = 0).to_dict()

# Glance at the top 30 words with highest counts.
print(HeadDict(ex_corpus_freq_dist, 30))

#### Task 3
##### Save `ex_X`, `ex_DTM`, `ex_word_counts_array`, `valid_titles`, `titles_clean`,  `titles_clean_list` and `ex_corpus_freq_dist`  files as pickles `ex_DTM_matrix`, `ex_DTM`, `ex_word_counts_array`, `valid_titles`, `ex_titles_clean`, `ex_titles_clean_list` and `ex_corpus_freq_dist` to be used in the next module.

#### Result:

In [None]:
pickle.dump(ex_X, open('ex_DTM_matrix.sav', 'wb'))
pickle.dump(ex_DTM, open('ex_DTM.sav', 'wb'))
pickle.dump(ex_word_counts_array, open('ex_word_counts_array.sav', 'wb'))
pickle.dump(valid_titles, open('valid_titles.sav', 'wb'))
pickle.dump(titles_clean, open('ex_titles_clean.sav', 'wb'))
pickle.dump(titles_clean_list, open('ex_titles_clean_list.sav', 'wb'))
pickle.dump(ex_corpus_freq_dist, open('ex_corpus_freq_dist.sav', 'wb'))