### Model Lesson Preview

Vocab: 
- **TF**: Term Frequency; how often a word appears in a document.
- **IDF**: Inverse Documnet Frequency; a measure based on in how many documents will a word appear.
- **TF-IDF**: A combination of the two measures above.
- **Raw Count**: This is simply the count of the number of occurances of each word.
- **Frequency**: The number of times each word appears divided by the total number of words.
- **Augmented Frequency**: The frequency of each word divided by the maximum frequency. This can help prevent bias towards larger documents.


In [1]:
#imports 
from pprint import pprint

import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

from prepare import basic_clean, lemmatize

In [2]:
#acquire text
document = 'Mary had a little lamb, a little lamb, a little lamb.'

# clean up the text
document = document.lower().replace(',', '').replace('.', '')

# transform into a series
words = pd.Series(document.split())

# From the Series we can extract the value_counts, which is our raw count
# for term frequency. Once we have the raw counts, we can calculate the
# other measures.

(pd.DataFrame({'raw_count': words.value_counts()})
 .assign(frequency=lambda df: df.raw_count / df.raw_count.sum())
 .assign(augmented_frequency=lambda df: df.frequency / df.frequency.max()))

Unnamed: 0,raw_count,frequency,augmented_frequency
lamb,3,0.272727,1.0
a,3,0.272727,1.0
little,3,0.272727,1.0
had,1,0.090909,0.333333
mary,1,0.090909,0.333333


**Inverse Document Frequency** tells us how much information a word provides. It is based on how commonly a word appears across multiple documents. The metric is divised such that the more frequently a word appears, the lower the IDF for that word will be.

NOTE: If a given word doesn't appear in any documents, the denominator in the equation above would be zero, so some definitions of idf will add 1 to the denominator.

**A higher IDF means that a word provides more information.** That is, it is more relevant within a single document.

### Calculating IDF for multiple words

In [5]:
# PREPARE THE DATA
# our 3 example documents
documents = {
    'news': 'Codeup announced last thursday that they just launched a new data science program. It is 18 weeks long.',
    'description': 'Codeup\'s data science program teaches hands on skills using Python and pandas.',
    'context': 'Codeup\'s data science program was created in response to a percieved lack of data science talent, and growing demand.'
}
pprint(documents)

print('\nCleaning and lemmatizing...\n')

documents = {topic: lemmatize(basic_clean(documents[topic])) for topic in documents}
pprint(documents)


{'context': "Codeup's data science program was created in response to a "
            'percieved lack of data science talent, and growing demand.',
 'description': "Codeup's data science program teaches hands on skills using "
                'Python and pandas.',
 'news': 'Codeup announced last thursday that they just launched a new data '
         'science program. It is 18 weeks long.'}

Cleaning and lemmatizing...

{'context': "codeup's data science program wa created in response to a "
            'percieved lack of data science talent and growing demand',
 'description': "codeup's data science program teach hand on skill using "
                'python and panda',
 'news': 'codeup announced last thursday that they just launched a new data '
         'science program it is 18 week long'}


In [6]:
## Calculate IDF for each word

def idf(word):
    n_occurences = sum([1 for doc in documents.values() if word in doc])
    return len(documents) / n_occurences

# Get a list of the unique words
unique_words = pd.Series(' '.join(documents.values()).split()).unique()

# put the unique words into a data frame
(pd.DataFrame(dict(word=unique_words))
 # calculate the idf for each word
 .assign(idf=lambda df: df.word.apply(idf))
 # sort the data for presentation purposes
 .set_index('word')
 .sort_values(by='idf', ascending=False)
 .head(5))

Unnamed: 0_level_0,idf
word,Unnamed: 1_level_1
teach,3.0
created,3.0
hand,3.0
skill,3.0
using,3.0


### Calculating TF-IDF  (TF times IDF)
TF-IDF is simply the multiplication of the two metrics we've discussed above. Let's calculate an TF-IDF for all of the words and documents:     

In [7]:
tfs = []

# We'll calculate the tf-idf value for every word across every document

# Start by iterating over all the documents
for doc, text in documents.items():
    # We'll make a data frame that contains the tf for every word in every document
    df = (pd.Series(text.split())
          .value_counts()
          .reset_index()
          .set_axis(['word', 'raw_count'], axis=1, inplace=False)
          .assign(tf=lambda df: df.raw_count / df.shape[0])
          .drop(columns='raw_count')
          .assign(doc=doc))
    # Then add that data frame to our list
    tfs.append(df)

# We'll then concatenate all the tf values together.
(pd.concat(tfs)
 # calculate the idf value for each word
 .assign(idf=lambda df: df.word.apply(idf))
 # then use the if and idf values to calculate tf-idf 
 .assign(tf_idf=lambda df: df.idf * df.tf)
 .drop(columns=['tf', 'idf'])
 .sort_values(by='tf_idf', ascending=False))

Unnamed: 0,word,doc,tf_idf
11,using,description,0.25
8,python,description,0.25
7,hand,description,0.25
4,teach,description,0.25
3,panda,description,0.25
1,skill,description,0.25
16,created,context,0.176471
13,response,context,0.176471
4,of,context,0.176471
3,demand,context,0.176471


In [8]:
### MORE COMMONLY SEEN IN HORIZONTAL FORMAT - with words as features
# We'll then concatenate all the tf values together.
(pd.concat(tfs)
 # calculate the idf value for each word
 .assign(idf=lambda df: df.word.apply(idf))
 # then use the if and idf values to calculate tf-idf 
 .assign(tf_idf=lambda df: df.idf * df.tf)
 .drop(columns=['tf', 'idf'])
 .sort_values(by='tf_idf', ascending=False)
 .pipe(lambda df: pd.crosstab(df.doc, df.word, values=df.tf_idf, aggfunc=lambda x: x))
 .fillna(0))

word,18,a,and,announced,codeup,codeup's,created,data,demand,growing,...,skill,talent,teach,that,they,thursday,to,using,wa,week
doc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
context,0.0,0.058824,0.088235,0.0,0.0,0.088235,0.176471,0.117647,0.176471,0.176471,...,0.0,0.176471,0.0,0.0,0.0,0.0,0.176471,0.0,0.176471,0.0
description,0.0,0.0,0.125,0.0,0.0,0.125,0.0,0.083333,0.0,0.0,...,0.25,0.0,0.25,0.0,0.0,0.0,0.0,0.25,0.0,0.0
news,0.166667,0.055556,0.0,0.166667,0.055556,0.0,0.0,0.055556,0.0,0.0,...,0.0,0.0,0.0,0.166667,0.166667,0.166667,0.0,0.0,0.0,0.166667


### <font color = 'red'> USING SCIKIT-LEARN TO FIND TF-IDF

In [9]:
#imports
from sklearn.feature_extraction.text import TfidfVectorizer


#make the thing
tfidf = TfidfVectorizer()

#fit the thing
tfidfs = tfidf.fit_transform(documents.values())
tfidfs

<3x36 sparse matrix of type '<class 'numpy.float64'>'
	with 45 stored elements in Compressed Sparse Row format>

In [10]:
# ONLY doing this step because our data set is small (sparse)
pd.DataFrame(tfidfs.todense(), columns=tfidf.get_feature_names())

Unnamed: 0,18,and,announced,codeup,created,data,demand,growing,hand,in,...,skill,talent,teach,that,they,thursday,to,using,wa,week
0,0.263566,0.0,0.263566,0.155666,0.0,0.155666,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.263566,0.263566,0.263566,0.0,0.0,0.0,0.263566
1,0.0,0.25388,0.0,0.19716,0.0,0.19716,0.0,0.0,0.333821,0.0,...,0.333821,0.0,0.333821,0.0,0.0,0.0,0.0,0.333821,0.0,0.0
2,0.0,0.195932,0.0,0.152159,0.257627,0.304317,0.257627,0.257627,0.0,0.257627,...,0.0,0.257627,0.0,0.0,0.0,0.0,0.257627,0.0,0.257627,0.0


### <font color = 'red'> USING TF-IDF in a model

Using **logistic regression model** as a standard **classification** problem
- data set is very imbalanced -- much more ham than spam

- because of the way we are modeling the data, it is not uncommon to have more columns than rows.
    

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from env import user, password, host


#acquire the data
def get_db_url(database, host=host, user=user, password=password):
    return f'mysql+pymysql://{user}:{password}@{host}/{database}'

url = get_db_url("spam_db")
sql = "SELECT * FROM spam"

df = pd.read_sql(sql, url, index_col="id")
df.head()


Unnamed: 0_level_0,label,text
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [13]:
#make the model
tfidf = TfidfVectorizer()

#fit the model
X = tfidf.fit_transform(df.text)
y = df.label

#Split into train and test (80/20???)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=.2)

#create a train and test dataframe
train = pd.DataFrame(dict(actual=y_train))
test = pd.DataFrame(dict(actual=y_test))

#make and fit logistic regression model on train
lm = LogisticRegression().fit(X_train, y_train)

#predict train and test
train['predicted'] = lm.predict(X_train)
test['predicted'] = lm.predict(X_test)

In [16]:
# CLASSIFICATION REPORT ON TRAIN
print('Accuracy: {:.2%}'.format(accuracy_score(train.actual, train.predicted)))
print('---')
print('Confusion Matrix')
print(pd.crosstab(train.predicted, train.actual))
print('---')
print(classification_report(train.actual, train.predicted))

Accuracy: 97.67%
---
Confusion Matrix
actual      ham  spam
predicted            
ham        3857   102
spam          2   496
---
              precision    recall  f1-score   support

         ham       0.97      1.00      0.99      3859
        spam       1.00      0.83      0.91       598

    accuracy                           0.98      4457
   macro avg       0.99      0.91      0.95      4457
weighted avg       0.98      0.98      0.98      4457



In [17]:
# CLASSIFICATION REPORT ON TEST

print('Accuracy: {:.2%}'.format(accuracy_score(test.actual, test.predicted)))
print('---')
print('Confusion Matrix')
print(pd.crosstab(test.predicted, test.actual))
print('---')
print(classification_report(test.actual, test.predicted))

Accuracy: 96.68%
---
Confusion Matrix
actual     ham  spam
predicted           
ham        964    35
spam         2   114
---
              precision    recall  f1-score   support

         ham       0.96      1.00      0.98       966
        spam       0.98      0.77      0.86       149

    accuracy                           0.97      1115
   macro avg       0.97      0.88      0.92      1115
weighted avg       0.97      0.97      0.97      1115

