# Machine Learning

## Introduction

Machine learning (ML) is a type of artificial intelligence (AI) that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so. Machine learning algorithms use historical data as input to predict new output values.

### Types of Machine Learning
- Supervised Learning
- Unsupervised Learning
- Semi Supervised Learning
- Reinforcement Learning

## Supervised Learning Algorithms

### Types of Supervised Learning Algorithms
- Regression
- Classification


#### Regression
- Linear Regression
- Polynomial Regression
- Support Vector Machines
- Decision Tree
- Random Forest

#### Classification
- Logistic Regression
- KNN Classification
- Support Vector Machines
- Decision Tree
- Random Forest

## Unsupervised Learning Algorithms

### Types of Unsupervised Learning Algorithm
- Clustering
- Association Rule Mining

## Errors in Learning Algorithm

### Error Calculation

- Regression
    - RMSE
    - R2 Score
    
- Classification
    - Accuracy
    - Precision
    - Recall

#### RMSE
![](img/rmse.png)

#### R2
![](img/r2_score.png)

#### Precision & Recall
![](img/accuracy_classification.png)

## Linear Regression

In [None]:
!pip install numpy pandas scikit-learn matplotlib

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv("./data/regression/USA_Housing.csv")
df

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address
0,79545.458574,5.682861,7.009188,4.09,23086.800503,1.059034e+06,"208 Michael Ferry Apt. 674\nLaurabury, NE 3701..."
1,79248.642455,6.002900,6.730821,3.09,40173.072174,1.505891e+06,"188 Johnson Views Suite 079\nLake Kathleen, CA..."
2,61287.067179,5.865890,8.512727,5.13,36882.159400,1.058988e+06,"9127 Elizabeth Stravenue\nDanieltown, WI 06482..."
3,63345.240046,7.188236,5.586729,3.26,34310.242831,1.260617e+06,USS Barnett\nFPO AP 44820
4,59982.197226,5.040555,7.839388,4.23,26354.109472,6.309435e+05,USNS Raymond\nFPO AE 09386
...,...,...,...,...,...,...,...
4995,60567.944140,7.830362,6.137356,3.46,22837.361035,1.060194e+06,USNS Williams\nFPO AP 30153-7653
4996,78491.275435,6.999135,6.576763,4.02,25616.115489,1.482618e+06,"PSC 9258, Box 8489\nAPO AA 42991-3352"
4997,63390.686886,7.250591,4.805081,2.13,33266.145490,1.030730e+06,"4215 Tracy Garden Suite 076\nJoshualand, VA 01..."
4998,68001.331235,5.534388,7.130144,5.44,42625.620156,1.198657e+06,USS Wallace\nFPO AE 73316


In [3]:
df.describe()

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,68583.108984,5.977222,6.987792,3.98133,36163.516039,1232073.0
std,10657.991214,0.991456,1.005833,1.234137,9925.650114,353117.6
min,17796.63119,2.644304,3.236194,2.0,172.610686,15938.66
25%,61480.562388,5.322283,6.29925,3.14,29403.928702,997577.1
50%,68804.286404,5.970429,7.002902,4.05,36199.406689,1232669.0
75%,75783.338666,6.650808,7.665871,4.49,42861.290769,1471210.0
max,107701.748378,9.519088,10.759588,6.5,69621.713378,2469066.0


In [4]:
from sklearn.preprocessing import StandardScaler

In [5]:
sc = StandardScaler()

In [9]:
X = df[["Avg. Area Income", "Avg. Area House Age", "Avg. Area Number of Rooms", "Avg. Area Number of Bedrooms", "Area Population"]]

In [10]:
y = df["Price"]

In [12]:
X

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population
0,79545.458574,5.682861,7.009188,4.09,23086.800503
1,79248.642455,6.002900,6.730821,3.09,40173.072174
2,61287.067179,5.865890,8.512727,5.13,36882.159400
3,63345.240046,7.188236,5.586729,3.26,34310.242831
4,59982.197226,5.040555,7.839388,4.23,26354.109472
...,...,...,...,...,...
4995,60567.944140,7.830362,6.137356,3.46,22837.361035
4996,78491.275435,6.999135,6.576763,4.02,25616.115489
4997,63390.686886,7.250591,4.805081,2.13,33266.145490
4998,68001.331235,5.534388,7.130144,5.44,42625.620156


In [13]:
y

0       1.059034e+06
1       1.505891e+06
2       1.058988e+06
3       1.260617e+06
4       6.309435e+05
            ...     
4995    1.060194e+06
4996    1.482618e+06
4997    1.030730e+06
4998    1.198657e+06
4999    1.298950e+06
Name: Price, Length: 5000, dtype: float64

In [15]:
from sklearn.model_selection import train_test_split

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [17]:
len(X_train), len(X_test)

(4000, 1000)

In [18]:
from sklearn.linear_model import LinearRegression

In [19]:
X_train = sc.fit_transform(X_train)

In [20]:
X_test = sc.transform(X_test)

In [21]:
X_train

array([[ 0.1549993 , -0.28044632, -0.42412554, -1.60733701, -0.28843825],
       [ 0.33883917,  0.92455013, -0.72902431, -1.60733701,  0.95730588],
       [ 0.71539104, -0.79936939, -0.0182288 , -1.42881885,  0.43263779],
       ...,
       [-0.60990224, -0.92567376, -0.92201448,  0.02366989,  0.14341368],
       [-0.36860526,  0.604432  , -0.23712894, -0.53622801,  0.11824928],
       [-0.03627546,  1.18224223,  1.09496033, -0.70663171, -1.81776875]])

In [22]:
sc.inverse_transform(X_test)

array([[7.37141648e+04, 6.10737087e+00, 6.33705381e+00, 3.08000000e+00,
        3.00491700e+04],
       [6.35639142e+04, 5.66257099e+00, 6.84870299e+00, 2.26000000e+00,
        3.80886360e+04],
       [7.30350258e+04, 7.55376265e+00, 8.11482616e+00, 4.23000000e+00,
        4.34084205e+04],
       ...,
       [7.01305606e+04, 8.19531683e+00, 9.57004816e+00, 4.07000000e+00,
        2.67942550e+04],
       [6.93139697e+04, 4.64525948e+00, 6.73400688e+00, 3.29000000e+00,
        3.04186819e+04],
       [7.10536920e+04, 7.00515222e+00, 7.44590387e+00, 6.48000000e+00,
        2.93592052e+04]])

In [23]:
lr = LinearRegression()
lr.fit(X_train, y_train)

LinearRegression()

In [24]:
pred = lr.predict(X_test)

In [25]:
from sklearn.metrics import mean_squared_error
from math import sqrt

In [26]:
mean_squared_error(y_test, pred)

10262474848.754446

### Text Classification

In [1]:
import re
import pandas as pd
import pickle as pkl
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [2]:
df = pd.read_csv("./data/review.tsv", sep="\t")
ps = PorterStemmer()
corpus = []

In [3]:
df

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
...,...,...
995,I think food should have flavor and texture an...,0
996,Appetite instantly gone.,0
997,Overall I was not impressed and would not go b...,0
998,"The whole experience was underwhelming, and I ...",0


In [4]:
set(stopwords.words("english"))

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [7]:
import nltk
nltk.download("all")

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /Users/alvynabranches/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /Users/alvynabranches/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /Users/alvynabranches/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /Users/alvynabranches/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /Users/alvynabranches/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloadin

False

In [5]:
for i in range(len(df)):
    customer_review = re.sub("[^a-zA-Z]", ' ', df["Review"][i])
    customer_review = customer_review.lower().split()
    clean_review = [ps.stem(word) for word in customer_review if not word in set(stopwords.words("english"))]
    clean_review = ' '.join(clean_review)
    corpus.append(clean_review)

In [6]:
corpus

['wow love place',
 'crust good',
 'tasti textur nasti',
 'stop late may bank holiday rick steve recommend love',
 'select menu great price',
 'get angri want damn pho',
 'honeslti tast fresh',
 'potato like rubber could tell made ahead time kept warmer',
 'fri great',
 'great touch',
 'servic prompt',
 'would go back',
 'cashier care ever say still end wayyy overpr',
 'tri cape cod ravoli chicken cranberri mmmm',
 'disgust pretti sure human hair',
 'shock sign indic cash',
 'highli recommend',
 'waitress littl slow servic',
 'place worth time let alon vega',
 'like',
 'burritto blah',
 'food amaz',
 'servic also cute',
 'could care less interior beauti',
 'perform',
 'right red velvet cake ohhh stuff good',
 'never brought salad ask',
 'hole wall great mexican street taco friendli staff',
 'took hour get food tabl restaur food luke warm sever run around like total overwhelm',
 'worst salmon sashimi',
 'also combo like burger fri beer decent deal',
 'like final blow',
 'found place acc

In [9]:
%%time

vectorizer = TfidfVectorizer(max_features=1500, min_df=3, max_df=0.6)
X = vectorizer.fit_transform(corpus).toarray().tolist()
y = df.iloc[:, 1].values.tolist()

CPU times: user 20.1 ms, sys: 4.81 ms, total: 24.9 ms
Wall time: 23.7 ms


In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

In [11]:
classifierKNN = KNeighborsClassifier(n_neighbors=5, metric="minkowski", p=2)
classifierKNN.fit(X_train, y_train)

KNeighborsClassifier()

In [13]:
print(classification_report(y_test, classifierKNN.predict(X_test)))

              precision    recall  f1-score   support

           0       0.53      0.85      0.65        91
           1       0.74      0.37      0.49       109

    accuracy                           0.58       200
   macro avg       0.63      0.61      0.57       200
weighted avg       0.64      0.58      0.56       200



In [14]:
print(accuracy_score(y_test, classifierKNN.predict(X_test)))
print(confusion_matrix(y_test, classifierKNN.predict(X_test)))

0.585
[[77 14]
 [69 40]]


In [19]:
pkl.dump(classifierKNN, open("model.pkl", "wb"))
pkl.dump(vectorizer, open("vectorizer.pkl", "wb"))

In [20]:
model = pkl.load(open("model.pkl", "rb"))
vec = pkl.load(open("vectorizer.pkl", "rb"))

In [21]:
print(accuracy_score(y_test, model.predict(X_test)))
print(confusion_matrix(y_test, model.predict(X_test)))

0.585
[[77 14]
 [69 40]]
