# Data Analysis of Movies & TV Shows available on OTT Platform

#### Team Members:
1. Radha Vishwakarma
2. Kshitij Shetty
3. Om Achrekar

<b>Problem Statement:</b> To perform data preprocessing and cleaning on the given data and to conclude some useful insights about the data.

Dataset used: https://www.kaggle.com/datasets/shivamb/netflix-shows

## 3. Recommendation System:

### Content based recommendation system:

The TF-IDF(Term Frequency-Inverse Document Frequency (TF-IDF) ) score is the frequency of a word occurring in a document, down-weighted by the number of documents in which it occurs. This is done to reduce the importance of words that occur frequently in plot overviews and therefore, their significance in computing the final similarity score.

In [26]:
df = pd.read_csv("/Users/omachrekar/Desktop/My Work/Data Science and Analytics Projects/Netflix EDA/df6.csv")
df

Unnamed: 0.1,Unnamed: 0,show_id,type,title,director,cast,country,release_year,rating,listed_in,description,date_n,runtime,seasons
0,7790,s7791,Movie,Prelude to War,Frank Capra,,United States,1942,TV-14,"Classic Movies, Documentaries",Frank Capra's documentary chronicles the rise ...,2017-03-31,52.0,
1,8205,s8206,Movie,The Battle of Midway,John Ford,"Henry Fonda, Jane Darwell",United States,1942,TV-14,"Classic Movies, Documentaries",Director John Ford captures combat footage of ...,2017-03-31,18.0,
2,8739,s8740,Movie,Why We Fight: The Battle of Russia,"Frank Capra, Anatole Litvak",,United States,1943,TV-PG,Documentaries,This installment of Frank Capra's acclaimed do...,2017-03-31,82.0,
3,8660,s8661,Movie,Undercover: How to Operate Behind Enemy Lines,John Ford,,United States,1943,TV-PG,"Classic Movies, Documentaries",This World War II-era training film dramatizes...,2017-03-31,61.0,
4,8763,s8764,Movie,WWII: Report from the Aleutians,John Huston,,United States,1943,TV-PG,Documentaries,Filmmaker John Huston narrates this Oscar-nomi...,2017-03-31,45.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7956,1260,s1261,Movie,Crazy About Her,Dani de la Orden,"Álvaro Cervantes, Susana Abaitua, Luis Zahera,...",Spain,2021,TV-MA,"Comedies, International Movies, Romantic Movies","After spending a wild night together, Adri dis...",2021-02-26,107.0,
7957,1265,s1266,Movie,The Girl on the Train,Ribhu Dasgupta,"Parineeti Chopra, Aditi Rao Hydari, Kirti Kulh...",India,2021,TV-MA,"International Movies, Thrillers",A troubled divorcée fixates on a seemingly ide...,2021-02-26,120.0,
7958,1269,s1270,Movie,Geez & Ann,Rizki Balki,"Hanggini, Junior Roberts, Roy Sungkono, Shenin...",Indonesia,2021,TV-PG,"Dramas, International Movies, Romantic Movies","After falling for Geez, a heartthrob at school...",2021-02-25,106.0,
7959,1230,s1231,Movie,Aelay,Halitha Shameem,"P. Samuthirakani, Manikandan, Madhumathi Padma...",India,2021,TV-14,"Comedies, Dramas, Independent Movies",A son stoically returns to his village upon hi...,2021-03-05,151.0,


### Content based filtering on single metrics:

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [28]:
#removing stopwords
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
df['description'] = df['description'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(df['description'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(7961, 17996)

In [29]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [30]:
indices = pd.Series(df.index, index=df['title']).drop_duplicates()

In [31]:
def get_recommendations(title, cosine_sim=cosine_sim):
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 5 most similar movies
    sim_scores = sim_scores[1:6]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 5 most similar movies
    return df['title'].iloc[movie_indices]

In [33]:
get_recommendations('Jab We Met')

335                 Anjaam
2072       Must Be... Love
548      Bollywood Calling
1611                 Ready
1740    Ek Main Aur Ekk Tu
Name: title, dtype: object

Jab we met: Romantic Comedy <br>
Anjaam: Psychological thriller <br>
Must Be... Love: Romantic Comedy <br>
Bollywood Calling: Comedy, drama <br>
Ready: Roamtic Comedy <br>
Ek Main Aur Ekk Tu: Roamtic Comedy

So we can see our recommendation using only plot is approximately 60% correct.

In [34]:
get_recommendations('Veronica')

4191        Good Morning Call
616     Raising Victor Vargas
4101                The Super
5280          Sara's Notebook
3837             Legend Quest
Name: title, dtype: object

Veronica: Horror, Drama <br>
Good Morning Call: Comedy <br>
Raising Victor Vargas: Romace, Drama <br>
The Super: Action, Comedy, Drama <br>
Sara's Notebook: Thriller, Adventure <br>
Legend Quest: Sci-Fi, Action

So we can see our recommendation using only plot is incorrect. Nothing in the recommendations matches.

### Content based filtering on multiple metrics:

In [39]:
filledna=df.fillna('')
filledna.head(2)

Unnamed: 0.1,Unnamed: 0,show_id,type,title,director,cast,country,release_year,rating,listed_in,description,date_n,runtime,seasons
0,7790,s7791,Movie,Prelude to War,Frank Capra,,United States,1942,TV-14,"Classic Movies, Documentaries",Frank Capra's documentary chronicles the rise ...,2017-03-31,52.0,
1,8205,s8206,Movie,The Battle of Midway,John Ford,"Henry Fonda, Jane Darwell",United States,1942,TV-14,"Classic Movies, Documentaries",Director John Ford captures combat footage of ...,2017-03-31,18.0,


In [40]:
def clean_data(x):
        return str.lower(x.replace(" ", ""))

In [41]:
features=['title','director','cast','listed_in','description']
filledna=filledna[features]

In [42]:
for feature in features:
    filledna[feature] = filledna[feature].apply(clean_data)
    
filledna.head()

Unnamed: 0,title,director,cast,listed_in,description
0,preludetowar,frankcapra,,"classicmovies,documentaries",frankcapra'sdocumentarychroniclestheriseofauth...
1,thebattleofmidway,johnford,"henryfonda,janedarwell","classicmovies,documentaries",directorjohnfordcapturescombatfootageofthebatt...
2,whywefight:thebattleofrussia,"frankcapra,anatolelitvak",,documentaries,thisinstallmentoffrankcapra'sacclaimeddocument...
3,undercover:howtooperatebehindenemylines,johnford,,"classicmovies,documentaries",thisworldwarii-eratrainingfilmdramatizeshowsec...
4,wwii:reportfromthealeutians,johnhuston,,documentaries,filmmakerjohnhustonnarratesthisoscar-nominated...


In [43]:
def create_soup(x):
    return x['title']+ ' ' + x['director'] + ' ' + x['cast'] + ' ' +x['listed_in']+' '+ x['description']

In [44]:
filledna['soup'] = filledna.apply(create_soup, axis=1)

In [45]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(filledna['soup'])

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [46]:
filledna=filledna.reset_index()
indices = pd.Series(filledna.index, index=filledna['title'])

In [51]:
def get_recommendations_new(title, cosine_sim=cosine_sim):
    title=title.replace(' ','').lower()
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 5 most similar movies
    sim_scores = sim_scores[1:6]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 5 most similar movies
    return df['title'].iloc[movie_indices]

In [52]:
get_recommendations_new('Jab We Met', cosine_sim2)

1108    Golmaal Returns
2313      The Shaukeens
3585            Görümce
637          Ishq Vishk
666              Khushi
Name: title, dtype: object

Jab we met: Romantic Comedy <br>
Golmaal Returns: Comedy <br>
The Shaukeens: Comedy <br>
Görümce: Romantic Comedy <br>
Ishq Vishk: Romantic Comedy <br>
Khushi: Romantic Comedy

So we can see our recommendation using multiple metrices is approximately 60% correct.

In [53]:
get_recommendations_new('Veronica', cosine_sim2)

3949                                           Kanika
3795    I Am the Pretty Thing That Lives in the House
7373                                      The Binding
7523                                           #Alive
7596                                       HOMUNCULUS
Name: title, dtype: object

Veronica: Horror, Drama <br>
Kanika: Horror <br>
I Am the Pretty Thing That Lives in the House: Horror <br>
The Binding: Horror <br>
#Alive: Drama, Thriller <br>
HOMUNCULUS: Psychological Horror

There is an increase in the accuracy of recommendations in case of multiple metrics. We can see 80% of recommendations are correct.