# Math and Stats Solving Cases

Math and Stats are essential for Data Science. We will look up a few examples the implementation of Math and Stats in Data Science. However, we'll not discuss all of the math and stats concepts, we focus on cosine similarity and feature selection using hypothesis testing.

## Recommender System using Cosine Similarity - Linear Algebra Implementation

Basically, recommender system is an information filtering system which select some item based on the similarity of users and/or items. There are two kinds of information filtering in recommender system which are `Content-based filtering` and `Collaborative filtering` (We'll discuss detaily in Phase 2). 


We'll try to make a simple content-based recommender system using cosine similarity. We measure the similarity between two items. Conceptually, we will recommend a user some items based on the simility of items that they like previously.

To remind you, cosine similarity is a 'distance' measurement of two vectors

<img src="https://softscients.com/wp-content/uploads/2020/03/2.-Cara-Menghitung-Cosine-similarity.png"></img>


<img src="https://www.researchgate.net/profile/Said-Salloum/publication/345471138/figure/fig2/AS:955431962808321@1604804139868/Cosine-similarity-formula.png"></img>

To start, we use Pandas for data loading and preprocessing and Numpy for linear algebra calculation. In this case, we will make a movie recommendation system.

In [None]:
import pandas as pd
import numpy as np

In [None]:
movie = pd.read_csv('https://github.com/MahnoorJaved98/Movie-Recommendation-System/blob/main/movie_dataset.csv?raw=true').dropna()
movie

Unnamed: 0,index,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew,director
0,0,237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,19995,culture clash future space war space colony so...,en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Sam Worthington Zoe Saldana Sigourney Weaver S...,"[{'name': 'Stephen E. Rivkin', 'gender': 0, 'd...",James Cameron
1,1,300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,285,ocean drug abuse exotic island east india trad...,en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Johnny Depp Orlando Bloom Keira Knightley Stel...,"[{'name': 'Dariusz Wolski', 'gender': 2, 'depa...",Gore Verbinski
2,2,245000000,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,206647,spy based on novel secret agent sequel mi6,en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Daniel Craig Christoph Waltz L\u00e9a Seydoux ...,"[{'name': 'Thomas Newman', 'gender': 2, 'depar...",Sam Mendes
3,3,250000000,Action Crime Drama Thriller,http://www.thedarkknightrises.com/,49026,dc comics crime fighter terrorist secret ident...,en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.312950,...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,Christian Bale Michael Caine Gary Oldman Anne ...,"[{'name': 'Hans Zimmer', 'gender': 2, 'departm...",Christopher Nolan
4,4,260000000,Action Adventure Science Fiction,http://movies.disney.com/john-carter,49529,based on novel mars medallion space travel pri...,en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,Taylor Kitsch Lynn Collins Samantha Morton Wil...,"[{'name': 'Andrew Stanton', 'gender': 2, 'depa...",Andrew Stanton
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4772,4772,31192,Drama Action Comedy,http://downterrace.blogspot.com/,42151,murder dark comedy crime family,en,Down Terrace,After serving jail time for a mysterious crime...,1.330379,...,89.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,You're only as good as the people you know.,Down Terrace,6.3,26,Robert Hill Robin Hill Julia Deakin David Scha...,"[{'name': 'Ben Wheatley', 'gender': 2, 'depart...",Ben Wheatley
4773,4773,27000,Comedy,http://www.miramax.com/movie/clerks/,2292,salesclerk loser aftercreditsstinger,en,Clerks,Convenience and video store clerks Dante and R...,19.748658,...,92.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Just because they serve you doesn't mean they ...,Clerks,7.4,755,Brian O'Halloran Jeff Anderson Jason Mewes Kev...,"[{'name': 'Kevin Smith', 'gender': 2, 'departm...",Kevin Smith
4781,4781,22000,Comedy Romance,https://www.facebook.com/DrySpellMovie,255266,dating divorce sex scene sex comedy anti roman...,en,Dry Spell,Sasha tries to get her soon-to-be ex husband K...,0.048948,...,90.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Getting divorced does funny things to a girl,Dry Spell,6.0,1,Suzi Lorraine Jared Degado Heather Dorff Racha...,"[{'name': 'Travis Legge', 'gender': 0, 'depart...",Travis Legge
4791,4791,13,Horror,http://tincanmanthemovie.com/,157185,home invasion,en,Tin Can Man,Recently dumped by his girlfirend for another ...,0.332679,...,84.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Everything You've Heard is True,Tin Can Man,2.0,1,Michael Parle Emma Eliza Regan Patrick O'Donne...,"[{'name': 'Ivan Kavanagh', 'gender': 0, 'depar...",Ivan Kavanagh


To simplify our system, we will use genres only for the vector elements. Remember that, cosine similarity needs vectors to do the calculation so we have to extract the vector from the genre data.

To extract the vector, we use one-hot encoding technique (a technique that labeling of the existance of a category), which the illustration represented by image below:

<img src="https://i.imgur.com/mtimFxh.png"></img>

since each movie has more than one genres, we have to do an extra preprocessing.

### One-Hot Encoding Process

In [None]:
genres = ' '
for g in movie['genres']:
  genres += g+' '

genres = list(set(genres.split(' ')))[1:]

In [None]:
genres

['Foreign',
 'Fiction',
 'Drama',
 'War',
 'Thriller',
 'Action',
 'Animation',
 'TV',
 'Romance',
 'Horror',
 'Fantasy',
 'Mystery',
 'Crime',
 'Comedy',
 'Science',
 'Adventure',
 'Family',
 'Music',
 'Movie',
 'Western',
 'Documentary',
 'History']

In [None]:
gen_mv = [[] for i in range(len(genres))]

for dat in movie['genres']:
  for i,g in enumerate(genres):
    if g in dat.split(' '):
      gen_mv[i].append(1)
    else:
      gen_mv[i].append(0)

In [None]:
gen_mv_dat = pd.DataFrame(np.array(gen_mv).T,columns=genres)
gen_mv_dat

Unnamed: 0,Foreign,Fiction,Drama,War,Thriller,Action,Animation,TV,Romance,Horror,...,Crime,Comedy,Science,Adventure,Family,Music,Movie,Western,Documentary,History
0,0,1,0,0,0,1,0,0,0,0,...,0,0,1,1,0,0,0,0,0,0
1,0,0,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,...,1,0,0,1,0,0,0,0,0,0
3,0,0,1,0,1,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,1,0,0,0,0,...,0,0,1,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1427,0,0,1,0,0,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1428,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1429,0,0,0,0,0,0,0,0,1,0,...,0,1,0,0,0,0,0,0,0,0
1430,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [None]:
title_df = movie[['original_title']].reset_index(drop=True)
movie_vector = pd.concat([title_df,gen_mv_dat],axis=1)
movie_vector.set_index('original_title',inplace=True)
movie_vector


Unnamed: 0_level_0,Foreign,Fiction,Drama,War,Thriller,Action,Animation,TV,Romance,Horror,...,Crime,Comedy,Science,Adventure,Family,Music,Movie,Western,Documentary,History
original_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,0,1,0,0,0,1,0,0,0,0,...,0,0,1,1,0,0,0,0,0,0
Pirates of the Caribbean: At World's End,0,0,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
Spectre,0,0,0,0,0,1,0,0,0,0,...,1,0,0,1,0,0,0,0,0,0
The Dark Knight Rises,0,0,1,0,1,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
John Carter,0,1,0,0,0,1,0,0,0,0,...,0,0,1,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Down Terrace,0,0,1,0,0,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
Clerks,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
Dry Spell,0,0,0,0,0,0,0,0,1,0,...,0,1,0,0,0,0,0,0,0,0
Tin Can Man,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


Voila! we have the movie vectors which represent each movie's genres. Next, we define the cosine similarity function to ease our similarity calculation.

In [None]:
def cosine_sim(vect1,vect2):
  norm_1 = np.linalg.norm(vect1)
  norm_2 = np.linalg.norm(vect2)

  cos_sim = (vect1 @ vect2) / (norm_1 * norm_2)
  return cos_sim

We want to test the function for Avatar and Tin Can Man. In fact, Avatar and Tin Can Man have different genres

In [None]:
movie[movie['original_title']=='Avatar'][['original_title','genres']]

Unnamed: 0,original_title,genres
0,Avatar,Action Adventure Fantasy Science Fiction


In [None]:
movie[movie['original_title']=='Tin Can Man'][['original_title','genres']]

Unnamed: 0,original_title,genres
4791,Tin Can Man,Horror


In [None]:
cosine_sim(movie_vector.loc['Avatar'], movie_vector.loc['Tin Can Man'])

0.0

The cosine similarity of both movies is zero, which is there is no similarity between them.

Imagine that you really love `Man of Steel` and our system will recommend you 5 movies that similar to `Man of Steel`.

In [None]:
def recsys(movie, top_N):
  cossim = pd.Series([cosine_sim(movie_vector.loc[movie],x) for x in movie_vector.values],index=movie_vector.index).drop(index=movie)
  print(f'You like {movie}, so based on our recommender system, We recommend you to watch:')
  for i,mv in enumerate(cossim.sort_values(ascending=False)[:top_N].index):
    print(f'{i+1}. {mv}')

In [None]:
recsys('Man of Steel',5)

You like Man of Steel, so based on our recommender system, We recommend you to watch:
1. Avatar
2. Jupiter Ascending
3. The Wolverine
4. X-Men: Days of Future Past
5. Teenage Mutant Ninja Turtles


# Feature Selection using Hypothesis Testing

Feature selection is very essential for machine learning modelling. We need to filter our data which ones will be used for the model input. Many methods used for this step and we will try to filtering our data to be features using hypothesis testing. The concept is similar to correlation which we want to know which features that have relation to label.

So why the hypothesis test can be used for this case? remind that we use hypothesis testing to test the significant difference among data. It's very useful for classification case which is very difficult to determine whether the features are correlated to the label.

For example, we want to classify whether a patient severe covid-19 or not based on the patient's medical report which are height, weight, oxygen level, body temperature, etc. We pick a variable which is height to test the significance between covid-19 patient and no covid-19. If there is a difference among them, so, we find the existence of a pattern in height variable. Reversely, we don't find the pattern and we can exclude the height variable from the data.

In this lesson, we want to try to implement the hypothesis testing for feature selection to travel insurance data. This case wants us to classify whether a person's insurance claim can be accepted or not (the information are on column `claim`).

We pick two columns which are distribution channel and duration to be tested.

In [1]:
import pandas as pd
from scipy import stats

In [2]:
df = pd.read_csv('https://github.com/fahmimnalfrzki/Dataset/blob/main/travel%20insurance.csv?raw=true')
df

Unnamed: 0,Agency,Agency Type,Distribution Channel,Product Name,Claim,Duration,Destination,Net Sales,Commision (in value),Gender,Age
0,CBH,Travel Agency,Offline,Comprehensive Plan,No,186,MALAYSIA,-29.0,9.57,F,81
1,CBH,Travel Agency,Offline,Comprehensive Plan,No,186,MALAYSIA,-29.0,9.57,F,71
2,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,No,65,AUSTRALIA,-49.5,29.70,,32
3,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,No,60,AUSTRALIA,-39.6,23.76,,32
4,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,No,79,ITALY,-19.8,11.88,,41
...,...,...,...,...,...,...,...,...,...,...,...
63321,JZI,Airlines,Online,Basic Plan,No,111,JAPAN,35.0,12.25,M,31
63322,JZI,Airlines,Online,Basic Plan,No,58,CHINA,40.0,14.00,F,40
63323,JZI,Airlines,Online,Basic Plan,No,2,MALAYSIA,18.0,6.30,M,57
63324,JZI,Airlines,Online,Basic Plan,No,3,VIET NAM,18.0,6.30,M,63


## Distribution channel

We want to test Distribution Channel that whether there is significant difference between accepted claim and not. Before we go further, we need to check the data type to determine the test method. 



In [4]:
df['Distribution Channel'].head()

0    Offline
1    Offline
2     Online
3     Online
4     Online
Name: Distribution Channel, dtype: object

This data consist of categorical data which only contain Offline and Online. Furthermore, we also know that Claim consists categorical data which yes and no. So, to test the significant difference between two variables that consist categorical data, we use `chi-squared test`. Before go further, let define our hypothesis:

**H0**: There is no relation between `Distribution Channel` and `Claim Status`

**H1**: There is relation between `Distribution Channel` and `Claim Status`

In [5]:
contingency_table=pd.crosstab(df["Distribution Channel"],df["Claim"])
contingency_table

Claim,No,Yes
Distribution Channel,Unnamed: 1_level_1,Unnamed: 2_level_1
Offline,1090,17
Online,61309,910


In [6]:
stat, p, dof, expected = stats.chi2_contingency(contingency_table)

print(f'P-value: {p}')

P-value: 0.9406016138343163


Since the p-value is more than 0.05, so we conclude that there is no relation between distribution channel and the label (claim status). So, we can take the column out from the features.

## Duration

Next, we want to test that whether the travel duration has relation to claim status? is there any difference between accepted claim and not?

Before we test the variable, we need to know that the travel duration is numerical data and claim status only consists two category (yes and no). So, we only test the travel duration of 'yes' and 'no'. Then, we will use `t-test two sample independent`. However, we need to seperate the duration data between 'yes' and 'no'.

In [7]:
duration_yes = df['Duration'][df['Claim']=='Yes']
duration_no = df['Duration'][df['Claim']=='No']

Our hypothesis:

**H0**: There is no significant difference between accepted claim and unaccepted claim in travel duration

**H1**: There is significant difference between accepted claim and unaccepted claim in travel duration

In [8]:
t_stat, p_val = stats.ttest_ind(duration_yes,duration_no)

print(f'P-value: {p_val}')

P-value: 8.522480803701236e-77


Unfortunantely we will exclude the travel duration from our features list since there is no significant difference in travel duration for yes and no claim status.