![](image/LambdaSchool.png)

# Supervised Machine Learning with Numerical and Text-Based Features

### Details
* <b> Event: </b> Lambda School Guest Lecture
* <b> Instructor: </b> Bruno Janota, Senior Data Scientist at Lockheed Martin
* <b> Date: </b> Monday, June 17th, 2019

### Sections
1. [Representing Text as Numbers](#1.-Representing-Text-as-Numbers)
2. [LDA Topic Modeling](#2.-LDA-Topic-Modeling)
3. [Next Steps](#3.-Next-Steps)

# 1. Representing Text as Numbers

In order for a computer to execute any form of analytics or machine learning on natural language, data scientists must convert the raw text that we as humans can comprehend into a format that computers can understand.  That format is a numeric representation.

The first task in any Natural Language Processing analysis is to parse the raw text into objects called tokens. This process is called <b>tokenization</b>. The result of tokenization is a list of words that represents each text input (document, tweet, etc.) in a dataset. 

Let's tokenize an example dataset of generic sentences. Feel free to experiment with your own sentence. For this task, we will use a python package developed by NLP researchers at Stanford called `NLTK` which stands for Natural Language ToolKit.

In [None]:
# Import the nltk package
import nltk
nltk.download("punkt")

In [None]:
# Experiment with your own sentence and observe the tokens
sentence = "I will be an NLP guru by the end of lunch today."
tokens = nltk.word_tokenize(sentence)
print(f"'{sentence}' becomes\n{tokens}")

### 1.1 Term Frequency & Text Pre-Processing

Once the raw text has been tokenized, one of the simplest ways to represent the tokenized data in numeric form is to compute the number of occurences of a word in the text, more commonly known as the <b>term frequency</b>.  This approach takes the tokens and counts the number of occurences for each within the tokenized text.

Luckily, there are many pre-built options in Python for performing this task. We will use Scikit-Learn's `CountVectorizer` function to calculate the term frequencies. Scikit-Learn is one of the industry standard Python packages for analytics.

Lets calculate the term frequency over a list of sentences. We can provide the list of raw text to the `CountVectorizer` function and use its default tokenizer, or pass the `NLTK` tokenizer as an argument.

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

sentences = [
    "I like this movie, very funny",
    "I hate all movies.",
    "Movie was awesome! I loved it.",
    "Good movie. I love it."
]

tf_vectorizer = CountVectorizer(lowercase=False)
X = tf_vectorizer.fit_transform(sentences)

tf_df = pd.DataFrame(data = X.toarray(), columns = tf_vectorizer.get_feature_names())
print(f'Number of sentences: {tf_df.shape[0]}\nNumber of tokens: {tf_df.shape[1]}\n')
tf_df

This table shows us how term frequency works. Each row is a sentence in our `sentences` list, and the columns represent the unique word counts across the dataset.

<i>Does anything odd stand out to you about the tokens themselves?</i>

These nuances are common problems associated with Natural Language Processing tasks but luckily there are several well documented aproaches and general NLP best practices to these problems that we will discuss in more detail below:
- Stop Word Removal
- Lowercasing text
- Stemming (rules based hueristic approach to convert words into their stem)
- Lemmatization (utilizes pre-determined vocabulary and morphological analysis of words to return the base form of a word)

Let's repeat the tokenization process above by using a different tokenizer than the default within `CountVectorizer`, converting tokens to lowercase, remove stop words, and observing the resulting term frequency table.

#### Stop Word Removal

In [None]:
nltk.download('stopwords')

Pythons NaturaL Lanugage Tool Kit (NLTK) package contains a list of 179 commonly used english words that do not have much value in helping to understand or extract meaning from text. It is usually a good starting point but can be easily extended for text in specific domains (social media, emails, surveys, etc.). 

In [None]:
# NLTK Stop words
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
#stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

print('Number of Stop Words: {}'.format(len(stop_words)))
print('Example Stop Words: {}'.format(stop_words[0:20]))

In [None]:
# Experiment with your own sentence and observe the tokens
sentence = "I will be an NLP guru by the end of lunch today."
tokens = nltk.word_tokenize(sentence)
tokens_no_stop = [word for word in tokens if word not in stop_words]
print(f"Tokenized sentence: '{tokens}")
print(f"Tokenized sentence without stop words: '{tokens_no_stop}'")

#### Lowercasing Text

Let's tokenize the same sentences as above removing stop words and lowercasing words.

In [None]:
tf_vectorizer = CountVectorizer(lowercase=True, stop_words=stop_words)
X = tf_vectorizer.fit_transform(sentences)

tf_df = pd.DataFrame(data = X.toarray(), columns = tf_vectorizer.get_feature_names())
print(f'Number of sentences: {tf_df.shape[0]}\nNumber of tokens: {tf_df.shape[1]}\n')
tf_df

This is great! We have reduced the sentences down to the most relevant words but there is still some duplication such as "movie" and "movies" and "love" and "loved".

#### Lemmatization and Stemming

In [None]:
nltk.download("wordnet")

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

# Experiment with your own sentence and observe the tokens
sentence = "The movies were awesome! I loved it." 

# Initialize Common Lemmatizers/Stemmers
wnl = WordNetLemmatizer()
ps = PorterStemmer()

# Compare WordNetLemmatizer and PorterStemmer
tokens = nltk.word_tokenize(sentence)
tokens_wnl = [wnl.lemmatize(t) for t in nltk.word_tokenize(sentence)]
tokens_ps = [ps.stem(t) for t in nltk.word_tokenize(sentence)]
print(f"Tokenized sentence: \n{tokens}")
print(f"Tokenized sentence with WordNetLemmatizer: \n{tokens_wnl}'")
print(f"Tokenized sentence with PorterStemmer: \n{tokens_ps}'")

In addition to the methods discussed above there are also python functions to remove punctuation.

In [None]:
# Import necessary packages
import string

# Remove the punctuation from our sentences
sentences_no_punc = [s.translate(str.maketrans('', '', string.punctuation)) for s in sentences]

# Examine sentences_no_punc
sentences_no_punc

### Spell Check/Autocorrect

Lastly, depending on the dataset you may want to autocorrect the text to catch misspellings (social media or other user generated text data).

In [None]:
from autocorrect import spell

word = 'mussage'
print(f'Original Spelling: {word}')
print(f'Autocorrect: {spell(word)}')

There are many pre-built packages in python that can help you transform and pre-process text data. This section introduced you to several of the common best practice text cleaning options. When there doesnt exist a pre-built option, it is fairly straightforward to build your own!

### 1.2 Term Frequency - Inverse Document Frequency (TF-IDF)

While term frequency is a great way to represent text in a numeric fashion, there are more advanced methods that capture information about the entire dataset. <b>Term Frequency-Inverse Document Frequency</b>, better known as <b>TF-IDF</b>, is another common, more advanced iteration of term frequency, and can be used to better represent text in relation to the entire dataset of text.

TF-IDF uses a calculation to determine a term's importance within the entire dataset. In easy to understand language, if a term occurs frequently in observation of the dataset, referred to a document, and doesn't occur frequently in other documents in the dataset, it should be given a higher numeric value. It is given a higher numeric value because it is unique to the identification of the document that it exists in. Terms that are common across most or all of the documents in the dataset in turn are given lower numeric values, because they don't help distinguish one document from another.

If you are interested in the math behind TF-IDF, it can be found [here](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

Lets use the same simple example as above to better visualize this numeric representation. We will use the `LemmaTokenizer` we created above to perform the tokenization and lemmatization, with one additional feature; we will remove punctuation as well!

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, articles):
        return [self.wnl.lemmatize(t) for t in nltk.word_tokenize(articles)]

# Initialize TF-IDF
tfidf_vectorizer = TfidfVectorizer(tokenizer=LemmaTokenizer(),
                                   lowercase=True)

# Fit to sentences_no_punc and transform into array
X = tfidf_vectorizer.fit_transform(sentences_no_punc)

tfidf_df = pd.DataFrame(data = X.toarray(), columns = tfidf_vectorizer.get_feature_names()).round(3)
tfidf_df

TF-IDF is the most common form of converting documents within a dataset, also known as corpus, into a numeric representation.  Now that we know how to preprocess raw text for modeling, lets learn about some different types of models and use some real-life data!

# 2. Topic Modeling

A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. The "topics" produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is.

### Load MLB Pitcher Data
#### Career Statistics, Pitch Style Description and Tommy John Surgery Indicator

In this section, we will explore a dataset that contains career level data and pitch style descriptions for 758 MLB starting pitchers that threw more than 100 innings between 2010-2018 as well as an indication of whether or not the pitcher got Ulnar Collateral Ligament Reconstruction commonly referred to as Tommy John Surgery. 
- The career statistics were exported from Fangraphs.com
- Pitch style descriptions were scraped from BrooksBaseball.net player cards
- TJ Surgery indicator was merged from google sheet maintained by @MLBPlayerAnalysis

In [None]:
import pandas as pd

# Read Career Statistics from FanGraphs and TJ Surgery Data with Pitch Style
fg_df = pd.read_csv('./data/FanGraphsCareerData.csv')
tj_df = pd.read_csv('./data/MLB_Pitchers_2008to2018_withBBPitchStyle.csv')
print(f'FanGraphs Dataframe Shape: {fg_df.shape}')
print(f'TJ Dataframe Shape: {tj_df.shape}')

In [None]:
# Merge Final dataframe with TJ label and FanGraphs Stats
df = pd.merge(tj_df, fg_df, on='playerid')
print(f'Final Dataframe Shape: {df.shape}\n')
df.head()

In [None]:
df.describe()

We can see right away that some features like IP (Innings Pitched) have skewed distributions and we can also see that many of the features have different units (i.e. innings, time, etc.). Feature scaling may help improve our model performance. We will use the sklearn MinMax scaler to normalize all the numerical features between 0 and 1 or -1 and 1 (for features with negative values). 

Standard Scaler:
$
\begin{align}
\frac{x_i - mean(x)} {stdev(x)}
\end{align}
$

Min-Max Scaler:
$
\begin{align}
\frac{x_i - min(x)} {max(x)–min(x)}
\end{align}
$

Robust Scaler:
$
\begin{align}
\frac{x_i - Q1(x)} {Q3(x)–Q1(x)}
\end{align}
$

In [None]:
import numpy as np
from sklearn import preprocessing

# Find numeric columns
num_cols = df.columns[df.dtypes.apply(lambda c: np.issubdtype(c, np.number))]

# Scale the numeric columns from 0-1 with Standard Scaler
scaler = preprocessing.RobustScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])

In [None]:
df.describe()

In [None]:
# Review list of columns in our dataset
print('Columns in our Dataset: \n{}'.format(list(df.columns)))

In [None]:
# Let's look at the Pitch Style Description for Clayton Kershaw
player = 'Clayton Kershaw'
print('Player Name: \n{}\n'.format(player))
print('Pitch Style Description: \n{}'.format(list(df[df.Name_x == player]["BB_PitchStyle"])))

Let's look at how many pitchers in our dataset have gotten Tommy John surgery. 

In [None]:
df.tjSurgery.value_counts()

In [None]:
# TJ surgery as a percentage
df.tjSurgery.value_counts(normalize=True)

Check if the dataset has any missing data.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

df.isna().sum().plot.bar(figsize=(16,6), color='blue')
plt.title('Count of Missing Values per Column (n = 758)')
plt.ylabel('Counts')
plt.show()

The missing data corresponds to velocity, percentage, and movement of pitches that a particular pitcher does not throw so we will fill with 0.

In [None]:
df = df.fillna(0)

When performing analysis on text data it is always a good idea to check for duplicate entries for fields that should be unique identifiers (playerid) and dropping rows that do not have any text.

In [None]:
# Check for duplicates in playerid
print('Any Duplicate PlayerIDs? {}'.format(any(df['playerid'].duplicated())))

### Let's transition to looking at pitch style description and some basic metrics

In [None]:
document_lengths = np.array(list(map(len, df.BB_PitchStyle.str.split(' '))))

print("The average number of words in a pitch style description is: {}.".format(np.mean(document_lengths)))
print("The minimum number of words in a pitch style description is: {}.".format(min(document_lengths)))
print("The maximum number of words in a pitch style description is: {}.".format(max(document_lengths)))

Let's take a look at the distribution of pitch style descriptions by whether or not a pitcher got Tommy John Surgery.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

fig, ax = plt.subplots(figsize=(10,6))

sns.distplot(document_lengths[df.tjSurgery == 'Yes'], ax=ax, label='Injured')
sns.distplot(document_lengths[df.tjSurgery == 'No'], ax=ax, label='Not Injured')

ax.set_title("Distribution of Number of Words", fontsize=16)
ax.set_xlabel("Number of Words")
plt.legend()
plt.show()

### Pre-Process Text and Build Vector Representation

Convert raw text to document-term matrix.

In [None]:
# Function to preprocess raw test from job description
def lemmatize(text):
    text = text.split()    
    lemmatizer = WordNetLemmatizer()
    lem_words = [lemmatizer.lemmatize(word) for word in text]
    text = " ".join(lem_words)
    return text

# Remove extra white space
df['BB_PitchStyle_Clean'] = df['BB_PitchStyle'].apply(lambda x: ' '.join(x.split()))

# Remove punctuation and numbers
df['BB_PitchStyle_Clean'] = df['BB_PitchStyle_Clean'].str.replace('[^\w\s]', ' ').str.replace('\d+', '')

# Convert to lower case
df['BB_PitchStyle_Clean'] = df['BB_PitchStyle_Clean'].str.lower()

# Lemmatize
df['BB_PitchStyle_Clean'] = df['BB_PitchStyle_Clean'].map(lambda x: lemmatize(x))

Let's look at the pitch style description before and after cleaning.

In [None]:
player = 'Clayton Kershaw'
print('Original Text: \n{}\n'.format(list(df[df.Name_x == player]["BB_PitchStyle"])))
print('After Cleaning: \n{}'.format(list(df[df.Name_x == player]["BB_PitchStyle_Clean"])))

### Let's build a Document Term Matrix for use in building our LDA topic model

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

stop_words.extend(['mph','beta','feature'])
tf_vectorizer = CountVectorizer(ngram_range = (1,2),
                                stop_words = stop_words,
                                max_df = 0.8, 
                                min_df = 2)

dtm_tf = tf_vectorizer.fit_transform(df.BB_PitchStyle_Clean)
dtm_feature_names = tf_vectorizer.get_feature_names()

print('Document Term Matrix Shape: {}'.format(dtm_tf.shape))

Look at top 50 most frequent words in the pitch style descriptions.

In [None]:
plt.figure(figsize=(16,6))
term_df = pd.DataFrame(dtm_tf.toarray(), columns=dtm_feature_names)
term_df.sum(axis=0).sort_values(ascending=False)[0:50].plot.bar(color='blue')
plt.title("Top 50 Most Frequent Words in Pitch Style Descriptions")
plt.ylabel("Frequency")
plt.show()

### Build the Topic Model

We have everything required to train the LDA model. In addition to the document term matrix, you need to provide:
- the number of topics

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

n_topics = 10
lda_tf = LatentDirichletAllocation(n_components=n_topics, random_state=10)
lda_tf.fit(dtm_tf)

### View the topics with pyLDAvis

The above LDA model is built with 10 topics where each topic is a combination of keywords and each keyword contributes a certain weight to the topic. You can see the keywords for each topic and the importance of each keyword using lda_model.print_topics() as shown next.

![](images/Inferring-Topic-from-Keywords.png)

In [None]:
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

# Visualize the topics
pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer)

### Build a Simple Decision Tree

Remove columns that should not be included in the models prediction.

In [None]:
df1 = df[df.columns.difference(['playerid','Name_x','BB_PitchStyle','Name_y','Team','IP','BB_PitchStyle_Clean','throws'])]
print('Shape of New DataFrame: {}\n'.format(df1.shape))
print('Features to Include in Model: \n{}'.format(list(df1.columns)))

#### Split Full Data Set into Train/Validation Set

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.decomposition import LatentDirichletAllocation

# Set seed for random number generator for reproducibility
seed=5

# Split dataframe into features and labels
features = df1.loc[:, df1.columns != 'tjSurgery']
labels = df1['tjSurgery']

# Split data using 80% to train model and 20% to validate performance
X_train1, X_test1, y_train1, y_test1 = train_test_split(features, labels, test_size = 0.2, random_state = seed)

# Confirm Shape of Train/Test data
print('Shape of Train Features: {}'.format(X_train1.shape))
print('Shape of Train Labels:   {}'.format(y_train1.shape))
print('Shape of Test Features:  {}'.format(X_test1.shape))
print('Shape of Test Labels:    {}'.format(y_test1.shape))

### Build a Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Train decision tree 
dtree1 = DecisionTreeClassifier(class_weight='balanced', random_state=seed)
dtree1.fit(X_train1, y_train1)

#### Get top_n Feature Importances from Decision Tree

The importance of a feature in our Decision Tree is computed as the (normalized) total reduction of the criterion brought by that feature and is also known as the Gini importance. 

In [None]:
top_n = 10
feat_imp1 = pd.DataFrame({'Importance': dtree1.feature_importances_})    
feat_imp1['Feature'] = X_train1.columns
feat_imp1.sort_values(by='Importance', ascending=False, inplace=True)
feat_imp1 = feat_imp1.iloc[:top_n]

# Plot Feature Importance Values
plt.figure(figsize=(20,6))
plt.bar(feat_imp1['Feature'], feat_imp1['Importance'])
plt.xticks(rotation=45)
plt.title('Model #1: Feature Importance')
plt.show()

In [None]:
import scikitplot as skplt

# Make predictions on validation data
test_predictions1 = dtree1.predict(X_test1)

# Plot Confusion Matrix on Validation Data
skplt.metrics.plot_confusion_matrix(y_test1, test_predictions1, 
                                    figsize=(8,8),
                                    x_tick_rotation=90,
                                    title='Confusion Matrix (Validation Data)',
                                    normalize=False)
plt.show()

In [None]:
from sklearn.metrics import classification_report, accuracy_score

m1_acc = accuracy_score(y_test1, test_predictions1)*100

print('Model #1: Validation Accuracy: {0:.2f}%\n'.format(m1_acc))
print(classification_report(y_test1, test_predictions1, target_names=['Not Injured', 'Injured']))

### Let's build the same Decision Tree Model as above and add the topic distributions for each pitcher

In [None]:
# Extract Topic Distributions from LDA model for each pitcher
col_names = ["Topic {0}".format(x) for x in range(0, n_topics)]
topic_dist = lda_tf.transform(dtm_tf)
topic_df = pd.DataFrame(topic_dist, columns = col_names)

# Join topic dataframe with numerical features from Method #1
df2 = pd.concat([df1, topic_df], axis=1)
print(f'Original Dataframe Shape: {df1.shape}')
print(f'New Dataframe Shape: {df2.shape}\n')
df2.head()

Perform train test split for the new data

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.decomposition import LatentDirichletAllocation

# Split dataframe into features and labels
features = df2.loc[:, df2.columns != 'tjSurgery']
labels = df2['tjSurgery']

# Split data using 80% to train model and 20% to validate performance
X_train2, X_test2, y_train2, y_test2 = train_test_split(features, labels, test_size = 0.2, random_state = seed)

# Confirm Shape of Train/Test data
print('Shape of Train Features: {}'.format(X_train2.shape))
print('Shape of Train Labels:   {}'.format(y_train2.shape))
print('Shape of Test Features:  {}'.format(X_test2.shape))
print('Shape of Test Labels:    {}'.format(y_test2.shape))

In [None]:
from sklearn.tree import DecisionTreeClassifier

dtree2 = DecisionTreeClassifier(class_weight='balanced', random_state=seed)
dtree2.fit(X_train2, y_train2)

#### Compare Feature Importance for Model #1 (Numerical-only) and Model #2 (Numerical + LDA)

In [None]:
# Get top_n feature importances from decision tree
top_n = 10
feat_imp2 = pd.DataFrame({'Importance': dtree2.feature_importances_})    
feat_imp2['Feature'] = X_train2.columns
feat_imp2.sort_values(by='Importance', ascending=False, inplace=True)
feat_imp2 = feat_imp2.iloc[:top_n]

# Plot Feature Importance Values
plt.figure(figsize=(16,12))

# Plot the feature importance for Model #1
plt.subplot(2, 1, 1)
plt.bar(feat_imp1['Feature'], feat_imp1['Importance'])
plt.xticks(rotation=45)
plt.title('Model #1: Feature Importance')

# Plot the feature importance for Model #2
plt.subplot(2, 1, 2)
plt.bar(feat_imp2['Feature'], feat_imp2['Importance'])
plt.xticks(rotation=45)
plt.title('Model #2: Feature Importance')
plt.show()

In [None]:
# Make predictions with model 2 on validation data
test_predictions2 = dtree2.predict(X_test2)

plt.figure(figsize=(16,8))

# Re-Plot Confusion Matrix on Validation Data for Model #1 (for comparison)
plot1 = skplt.metrics.plot_confusion_matrix(y_test1, test_predictions1, 
                                            figsize=(8,8),
                                            title='Model #1: Numerical Features',
                                            normalize=False,
                                            ax = plt.subplot(1, 2, 1))

# Plot Confusion Matrix on Validation Data for Model #2
plot2 = skplt.metrics.plot_confusion_matrix(y_test2, test_predictions2, 
                                            figsize=(8,8),
                                            title='Model #2: Numerical + LDA Topics',
                                            normalize=False,
                                            ax = plt.subplot(1, 2, 2))

In [None]:
m2_acc = accuracy_score(y_test2, test_predictions2)*100

target_names = ['Not Injured','Injured']
print('Model #2: Overall Accuracy: {0:.2f}%'.format(m2_acc))
print('Improvement over Model #1:   {0:.2f}%\n'.format(m2_acc - m1_acc))
print('Model #1 (Numerical Features Only):')
print(classification_report(y_test1, test_predictions1, target_names=target_names))
print('Model #2 (Numerical Features + LDA Topics):')
print(classification_report(y_test2, test_predictions2, target_names=target_names))

# 3. Next Steps

- Implement technique to identify the optimal number of topics for LDA model and assess impact
- Try different sampling techniques (up, down, SMOTE, etc.) and understand performance impacts
- Download Lahman database master table and identify which players are still active. Use the Master table to find the pitchers that ended their career with no injury.
- 