# Data-Sitters Club 8: Just the Code

This notebook contains just the code (and a little bit of text) from the portions of *[DSC 8: Text-Comparison-Algorithm-Crazy-Quinn](https://datasittersclub.github.io/site/dsc8/)* for using Euclidean and cosine distance with word counts and word frequencies, and running TF-IDF for your texts.

This code assumes you've actually read the Data-Sitters Club book already. There's lots of pitfalls if you just try to apply the code without understanding what it's doing, or the effect caused by the various different options. Read first, then try!

## Load modules

In [None]:
#Installs seaborn
#You only need to run this cell the first time you run this notebook
import sys
!{sys.executable} -m pip install seaborn

In [None]:
#Imports the count vectorizer from Scikit-learn along with 
from sklearn.feature_extraction.text import CountVectorizer
#Glob is used for finding path names
import glob
#We need these to format the data correctly
from scipy.spatial.distance import pdist, squareform
#In case you're starting to run the code just at this point, we'll need os again
import os
import numpy as np
#In case you're starting to run the code just at this point, we'll need pandas again
import pandas as pd
#Import matplotlib
import matplotlib.pyplot as plt
#Import seaborn
import seaborn as sns

## Set the file directory for your corpus

In [None]:
filedir = '/Users/qad/Documents/dsc_corpus_clean'
os.chdir(filedir)

# Word count vectorizer
This looks at just the top 1000 words, and doesn't use `max_df` to remove words that occur across all your texts. You can add it in between the input and the `max_features` parameters, separated by a comma (e.g. `input="filename", max_df=.7, max_features=1000`).

In [None]:
# Use the glob library to create a list of file names, sorted alphabetically
# Alphabetical sorting will get us the books in numerical order
filenames = sorted(glob.glob("*.txt"))
# Parse those filenames to create a list of file keys (ID numbers)
# You'll use these later on.
filekeys = [f.split('/')[-1].split('.')[0] for f in filenames]

# Create a CountVectorizer instance with the parameters you need
wordcountvectorizer = CountVectorizer(input="filename", max_features=1000)
# Run the vectorizer on your list of filenames to create your wordcounts
# Use the toarray() function so that SciPy will accept the results
wordcounts = wordcountvectorizer.fit_transform(filenames)

### Bonus: word count toy
The code below will display all the words that were included in the word count vectorizer, based on the parameters you've set.

In [None]:
sum_words = wordcounts.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in wordcountvectorizer.vocabulary_.items()]
sorted(words_freq, key = lambda x: x[1], reverse=True)

## Euclidean distance for word count vectorizer

In [None]:
#Runs the Euclidean distance calculation, prints the output, and saves it as a CSV
euclidean_distances = pd.DataFrame(squareform(pdist(wordcounts)), index=filekeys, columns=filekeys)
euclidean_distances

### Euclidean distance visualization

In [None]:
#Defines the size of the image
plt.figure(figsize=(100, 100))
#Increases the label size so it's more legible
sns.set(font_scale=3)
#Generates the visualization using the data in the dataframe
ax = sns.heatmap(euclidean_distances)
#Displays the image
plt.show()

## Cosine distance for word count vectorizer

In [None]:
cosine_distances = pd.DataFrame(squareform(pdist(wordcounts, metric='cosine')), index=filekeys, columns=filekeys)
cosine_distances

### Cosine distance visualization

In [None]:
#Defines the size of the image
plt.figure(figsize=(100, 100))
#Increases the label size so it's more legible
sns.set(font_scale=3)
#Generates the visualization using the data in the dataframe
ax = sns.heatmap(cosine_distances)
#Displays the image
plt.show()

# Term frequency vectorizer

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Use the glob library to create a list of file names, sorted alphabetically
# Alphabetical sorting will get us the books in numerical order
filenames = sorted(glob.glob("*.txt"))
# Parse those filenames to create a list of file keys (ID numbers)
# You'll use these later on.
filekeys = [f.split('/')[-1].split('.')[0] for f in filenames]

# Create a CountVectorizer instance with the parameters you need
freqvectorizer = TfidfVectorizer(input="filename", stop_words=None, use_idf=False, norm='l1', max_features=1000)
# Run the vectorizer on your list of filenames to create your wordcounts
# Use the toarray() function so that SciPy will accept the results
wordfreqs = freqvectorizer.fit_transform(filenames).toarray()

## Euclidean distance for term frequency vectorizer

In [None]:
euclidean_distances_freq = pd.DataFrame(squareform(pdist(wordfreqs, metric='euclidean')), index=filekeys, columns=filekeys)
euclidean_distances_freq

### Euclidean distance visualization

In [None]:
#Defines the size of the image
plt.figure(figsize=(100, 100))
#Increases the label size so it's more legible
sns.set(font_scale=3)
#Generates the visualization using the data in the dataframe
ax = sns.heatmap(euclidean_distances_freq)
#Displays the image
plt.show()

## Cosine distance for word count vectorizer

In [None]:
cosine_distances_freq = pd.DataFrame(squareform(pdist(wordfreqs, metric='cosine')), index=filekeys, columns=filekeys)
cosine_distances_freq

### Cosine distance visualization

In [None]:
#Defines the size of the image
plt.figure(figsize=(100, 100))
#Increases the label size so it's more legible
sns.set(font_scale=3)
#Generates the visualization using the data in the dataframe
ax = sns.heatmap(cosine_distances_freq)
#Displays the image
plt.show()

## TF-IDF

In [None]:
# Use the glob library to create a list of file names, sorted alphabetically
# Alphabetical sorting will get us the books in numerical order
filenames = sorted(glob.glob("*.txt"))
# Parse those filenames to create a list of file keys (ID numbers)
# You'll use these later on.
filekeys = [f.split('/')[-1].split('.')[0] for f in filenames]

# Create a CountVectorizer instance with the parameters you need
vectorizer = TfidfVectorizer(input="filename", stop_words=None, use_idf=True, norm=None, max_features=1000, max_df=.95)
# Run the vectorizer on your list of filenames to create your wordcounts
# Use the toarray() function so that SciPy will accept the results
transformed_documents = vectorizer.fit_transform(filenames)
transformed_documents_as_array = transformed_documents.toarray()

Create a CSV per text file with most distinctive terms.

In [None]:
# construct a list of output file paths using the previous list of text files the relative path for tf_idf_output
output_filenames = [str(txt_file).replace(".txt", ".csv") for txt_file in filenames]

# loop each item in transformed_documents_as_array, using enumerate to keep track of the current position
for counter, doc in enumerate(transformed_documents_as_array):
    # construct a dataframe
    tf_idf_tuples = list(zip(vectorizer.get_feature_names(), doc))
    one_doc_as_df = pd.DataFrame.from_records(tf_idf_tuples, columns=['term', 'score']).sort_values(by='score', ascending=False).reset_index(drop=True)

    # output to a csv using the enumerated value for the filename
    one_doc_as_df.to_csv(output_filenames[counter])

## Suggested Citation

Dombrowski, Quinn. “DSC #8: Just the Code.” Jupyter Notebook. *The Data-Sitters Club*, October 21, 2020. https://github.com/datasittersclub/dsc8.