# Introduction to ML


## Summary

* Run install_packages.ipynb

**Ex 1** End-to-end decision tree

* Step 1: Data
* Step 2: Processing
* Step 3: Modeling (no vectorization step)
* Step 4: Visualization
* Change parameters
   
**Ex 2** Cleanse and feaure extraction applied to natural language

* Text cleaning with regex
* Feature extraction
    * Tokenizer
    * Removing stop words
    * Vectorization
    * Lematization
    
**Ex 3** Sentiment analysis with pre-trained model

## Ex 1: End-to-end project: Decision Tree

### Goals

* Predict survival rate of titanic passengers
* Practice decision trees
* End-to-end ML project

### About the data

We will use the well known Titanic dataset.

The dataset has the following columns:

* `Survived` - boolean
    * 0 - No
    * 1 - Yes
    
* `Pclass` (passenger class) - enumerated
    * 1
    * 2
    * 3
    
* `Name` - string

* `Sex` - enumerated
    * male
    * female

* `Age` - integer

* `Siblings/Spouses Aboard` (number of siblings/spouses) - integer

* `Parents/Children Aboard` (number of parents/children) - integer

* `Fare` (in pounds) - float

### First things first: import packages

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn import metrics

### Step 1: Data

In [None]:
!wget https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv

# Load data into a pandas dataframe
df = pd.read_csv("titanic.csv")

# See the first 10 rows
df.head(10)

### Step 2: Processing

Rename columns with more simple names:

In [None]:
df = df.rename(columns={
    "Survived": "survived",
    "Pclass": "pclass",
    "Name": "name",
    "Sex": "sex",
    "Age": "age",
    "Siblings/Spouses Aboard": "ss_ab",
    "Parents/Children Aboard": "pc_ab",
    "Fare": "fare"
})

Select columns (drop the ones that don't look like relevant):

In [None]:
df = df[["survived", "pclass", "sex", "age", "ss_ab", "pc_ab"]]

Convert `sex` string to numbers:

* 1 for `female`
* 2 for `male`

In [None]:
df["sex"].unique()

In [None]:
# Do not run this cell more than once
df["sex"] = df["sex"].apply(lambda x: 1 if (x == "female") else 2)

df

### Step 3: Modelling

Separate features and target:

In [None]:
# Features
features = ["pclass", "sex", "age", "ss_ab", "pc_ab"]
X = df[features]

# Target (label)
y = df["survived"]

Create training dataset and test dataset (80/20 split):

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

print(f"Training: X: {X_train.shape} y: {y_train.shape}")
print(f"Test: X: {X_test.shape} y: {y_test.shape}")

Build decision tree model (fit):

In [None]:
max_depth = 3
clf = DecisionTreeClassifier(max_depth=max_depth)

clf = clf.fit(X_train, y_train)

Predict:

In [None]:
y_pred = clf.predict(X_test)

print(f"Accuracy: {metrics.accuracy_score(y_test, y_pred)}")

print(f"Report:\n{metrics.classification_report(y_test, y_pred)}")

### Step 4: Visualization

In [None]:
# Labels are:
#     0 -> died
#     1 -> survived

fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(16,9), dpi=100)

plot_tree(clf,
          feature_names = features,
          class_names=["died", "survived"],
          impurity=False,
          filled=True)

fig.savefig("titanic_decision_tree.png")

Profiles of people that survived:
    
* Woman, rich (class 1 or 2), age between 3 and 39

Profiles of people that died:

* Man, older than 13, poor (class 2 or 3)

In [None]:
# Woman, rich (classe 1 or 2), age between 3 and 39
# ["pclass", "sex", "age", "ss_ab", "pc_ab"] = [1, 1, 20, 0, 0]
prediction = clf.predict([[1,1,38,0,0]])
print(prediction)


# Man, older than 13, poor (class 2 or 3)
# ["pclass", "sex", "age", "ss_ab", "pc_ab"] = [2, 2, 1000, 0, 0]
prediction = clf.predict([[3, 2, 30, 0, 0]])
print(prediction)

### Change some parameters

In [None]:
features = ["pclass", "sex", "age", "ss_ab", "pc_ab"]
X = df[features]
y = df["survived"]

# IMPORTANT -> change test_size
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# IMPORTANT -> change max_depth (integer or None)
max_depth = 2
clf = DecisionTreeClassifier(max_depth=max_depth)

clf = clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(f"Accuracy: {metrics.accuracy_score(y_test, y_pred)}")

fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(16,9), dpi=100)

plot_tree(clf,
          feature_names = features,
          class_names=["died", "survived"],
          impurity=False,
          filled=True)

fig.savefig("titanic_decision_tree.png")

## Ex 2: Cleanse and vectorization applied to natural language


### Goals

* Text cleaning with regex
* Feature extraction
    * Tokenizer
    * Removing stop words
    * Vectorization
    * Lematization

### First things first

In [None]:
import re

import nltk

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

### Text cleaning

Only alphabetic chars:

In [None]:
''.join((x for x in "MadMax25" if x.isalpha()))

In [None]:
dummy = "The lazy dog jumped_'-! over the sleeping 123456FOX!"

''.join((x for x in dummy if x.isalpha()))

Now with regular expressions:

In [None]:
re.sub('[^a-zA-Z\ ]', '', dummy)

Remove what is between parenthesis:

In [None]:
dummy = "The lazy dog (whose name is Pluto) jumped over the sleeping fox (whose name is Foxie) yesterday"

clean_text = re.sub('\(.*?\)', '', dummy)
clean_text = re.sub('\s\s', ' ', clean_text)

clean_text

To try regex: https://pythex.org/

### Tokenization

In [None]:
dummy = "The lazy dog umped over the sleeping fox"

word_tokenize(dummy)

### Removing stop words

Stop words usually refers to the most common words in a language.

In NLP, stop words are normally filtered out.


In [None]:
# In English (show 25 words only)
print(stopwords.words('english')[:25])

# In Spanish (show 25 words only)
print(stopwords.words('spanish')[:25])

In [None]:
dummy = "Do you know what? The lazy dog (whose name is Pluto) jumped over the sleeping fox (whose name is Foxie) yesterday."

# Only alphabetic chars
clean_text = re.sub('[^a-zA-Z\ ]', '', dummy)
clean_text = re.sub('\s\s', ' ', clean_text)

# No capital letters
clean_text = clean_text.lower()

# Remove stopwords
words = word_tokenize(clean_text)
clean_text = [x for x in words if x not in stopwords.words('english')]
clean_text = ' '.join(clean_text)
clean_text

### Vectorization

In [None]:
corpus = [
    "a car has four wheels two mirrors one middle mirror and four seats",
    "a scooter has two wheels two mirrors no middle mirror and two seat",
    "cars have diesel or gasoline engine",
    "scooters have gasoline engine"
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names())

print(X.toarray())

Let's make better, let's remove stop words:

In [None]:
def remove_stop_words(text):
    tokens = word_tokenize(text)
    
    tokens = [x for x in tokens if x not in stopwords.words('english')]

    return ' '.join(tokens)

corpus = [remove_stop_words(x) for x in corpus]
corpus

#### Count vectorizer

In [None]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names())

print(X.toarray())

#### TfidfVectorizer

In [None]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names())

print(X.toarray())

### Stemming

In [None]:
dummy = "a car has four wheels cars windows one rear window and one gasoline engine"

# Tokenize and remove stop words
tokens = word_tokenize(dummy)
tokens = [x for x in tokens if x not in stopwords.words('english')]


# Get stems
stemmer = PorterStemmer()

key_words = [stemmer.stem(x) for x in tokens]
set(key_words)

### Lemmatization

In [None]:
dummy = "a car has four wheels cars windows one rear window and one gasoline engine"

# Tokenize and remove stop words
tokens = word_tokenize(dummy)
tokens = [x for x in tokens if x not in stopwords.words('english')]


# Get lemas
lemmatizer = WordNetLemmatizer() 

key_words = [lemmatizer.lemmatize(x) for x in tokens]
set(key_words)

## Ex 3: Sentiment analysis with pre-trained model

### Goals

* Learn about sentiment analysis
* Use a pre-trained model

### First things first

In [None]:
%pip install vaderSentiment

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

### Example

In [None]:
sentences = ["MadMax is an awesome movie",
             "Titanic is a terrible movie",
             "Titanic made my cry",
             "Titanic didn't made me cry",
             "Thanks God it's thursday.",
             "My wife is pregnant",
             "Messi is 🙀 😍",
             ]

analyzer = SentimentIntensityAnalyzer()
for sentence in sentences:
    vs = analyzer.polarity_scores(sentence)
    print("{:-<30} {}".format(sentence, str(vs)))