# Machine Learning

 # Naive Bayes
 
## Cases for exercising knowledge on ML:

> [Application check for platform](https://www.cjr.org/analysis/foia-request-how-to-study.php),

>[FOIA Predictor](https://datadotworld.shinyapps.io/foia_shiny_app/), 

> [fake news challenge ](https://github.com/FakeNewsChallenge/fnc-1-baseline/blob/master/feature_engineering.py)



### Importing data

In [107]:
import pandas as pd

df = pd.read_csv("recipes.csv")
df.head()

Unnamed: 0,cuisine,id,ingredient_list
0,greek,10259,"romaine lettuce, black olives, grape tomatoes,..."
1,southern_us,25693,"plain flour, ground pepper, salt, tomatoes, gr..."
2,filipino,20130,"eggs, pepper, salt, mayonaise, cooking oil, gr..."
3,indian,22213,"water, vegetable oil, wheat, salt"
4,indian,13162,"black pepper, shallots, cornflour, cayenne pep..."


### Label column

In [108]:
df['is_italian'] = (df['cuisine'] == 'italian').astype(int)
df.head()

Unnamed: 0,cuisine,id,ingredient_list,is_italian
0,greek,10259,"romaine lettuce, black olives, grape tomatoes,...",0
1,southern_us,25693,"plain flour, ground pepper, salt, tomatoes, gr...",0
2,filipino,20130,"eggs, pepper, salt, mayonaise, cooking oil, gr...",0
3,indian,22213,"water, vegetable oil, wheat, salt",0
4,indian,13162,"black pepper, shallots, cornflour, cayenne pep...",0


In [109]:
df[df.is_italian == 1].head(4)

Unnamed: 0,cuisine,id,ingredient_list,is_italian
7,italian,3735,"sugar, pistachio nuts, white almond bark, flou...",1
9,italian,12734,"chopped tomatoes, fresh basil, garlic, extra-v...",1
10,italian,5875,"pimentos, sweet pepper, dried oregano, olive o...",1
12,italian,2698,"Italian parsley leaves, walnuts, hot red peppe...",1


### Dataframe with features

In [111]:
features_df = pd.DataFrame({
    'has_tomatoes': df.ingredient_list.str.contains('tomato').astype(int),
    'has_olive_oil': df.ingredient_list.str.contains('olive oil').astype(int),
    'has_soy_sauce': df.ingredient_list.str.contains('soy sauce').astype(int)
})
features_df.head(3)

Unnamed: 0,has_tomatoes,has_olive_oil,has_soy_sauce
0,1,0,0
1,1,0,0
2,0,0,1


In [112]:
features_df.head(3)

Unnamed: 0,has_tomatoes,has_olive_oil,has_soy_sauce
0,1,0,0
1,1,0,0
2,0,0,1


## Data splitting

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    features_df.values, # includes tomatoes, olive oil etc 
    df.is_italian, #label 0-1 (0=No /1= Yes)
    test_size=0.2)

## Using classifier

###  import classifier

In [126]:
from sklearn.naive_bayes import BernoulliNB

clf = BernoulliNB()

## A litle bit of theory before moving on:
> Understanding maths behing [Naive Bayes](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)

> Selecting the [model](https://scikit-learn.org/stable/modules/naive_bayes.html) we need to use according to our dataset and features


### Classifier training

In [127]:
clf.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [128]:
clf.score(X_test, y_test)

0.7988686360779385

## Testing classifier

In [129]:
clf.score(X_test, y_test)

0.7988686360779385

Πώς πάει στα training data?

In [130]:
clf.score(X_train, y_train)

0.8035764794619566

## ----------------------------------------------------

## Count vectorizer

In [136]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(token_pattern= r'[a-z][a-z]+', lowercase=False)
counts = vec.fit_transform(df['ingredient_list']) 


In [137]:
feats = pd.DataFrame(counts.toarray(),columns= vec.get_feature_names())

In [139]:
dataset = df.merge(feats, left_index=True, right_index=True)

In [140]:
dataset['is_mexican'] = (dataset['cuisine'] == 'mexican').astype(int)

In [141]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    feats, 
    dataset.is_mexican, 
    test_size=0.2) 

In [142]:
clf.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [143]:
clf.score(X_test, y_test)

0.960779384035198

In [144]:
estimator.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=5, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=0, splitter='best')

In [145]:
val_pred = estimator.predict(X_test)

In [146]:
f1_score(y_test, val_pred, average='macro')

0.8076218986614615

## Dummy classifier for real results check 

In [147]:
from sklearn.dummy import DummyClassifier

cldummy = DummyClassifier(strategy='constant', constant=0)

cldummy.fit(X_train, y_train)


cldummy.score(X_test, y_test)

0.843117536140792

In [1]:
#Findind the mistakes
#test = df.merge(features_df, left_index=True, right_index=True)
#test[test['is_korean'] !=test['has_kimchi']]

## Continue as you wish...