#### TV Shows and Movies listed on Netflix

Link to Kaggle -- https://www.kaggle.com/shivamb/netflix-shows/tasks?taskId=2447

This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine.

In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.

Integrating this dataset with other external datasets such as IMDB ratings, rotten tomatoes can also provide many interesting findings.
Inspiration

Some of the interesting questions (tasks) which can be performed on this dataset -

    Understanding what content is available in different countries
    Identifying similar content by matching text-based features
    Network analysis of Actors / Directors and find interesting insights
    Is Netflix has increasingly focusing on TV rather than movies in recent years.
Movie Recommendation System

### Task Details :

**Recommendation system is required in subscription-based OTG platforms. **
Recommended engine generally in three types 

    1.content Based recommended engine
    2.collaborative recommender engine and 
    3.hybrid recommended engine
    
Expected Submission

With the help of this particular data set you have to build a recommended engine. And your recommended engine will return maximum 10 movies name if an user search for a particular movie.
Evaluation

Recommended engine must return 5 movie names and maximum it can return 10 movie names if an user search for a particular movie. This recommender engine should not give suggestion in between 1 to 4 and 6 to 10 it have to return 5 movie names for 10 movie names.


#### About the dataset

**netflix_titles.csv:** The csv file contains information about the various movies and the data related to them:

    - Show ID - unique ID of that particular show
    - Type - type of the video - movie, TV Series etc.
    - Title - title of the video
    - Director - director name
    - Cast - cast members
    - Country - country where it was released
    - Data Added - date when it became live on NETFLIX
    - Release Year - year of release
    - Rating - user rating
    - Duration - duration of the movie, TV Series etc.
    - Listed in - Genre information
    - Description - concise plot of the series

In [1]:
import pandas as pd
import src.utils 
import string
import dill

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import sigmoid_kernel

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
df = pd.read_csv("data/netflix_titles.csv")
df.head(3)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."


In [3]:
df.isnull().sum()

show_id            0
type               0
title              0
director        2389
cast             718
country          507
date_added        10
release_year       0
rating             7
duration           0
listed_in          0
description        0
dtype: int64

In [4]:
# Определим к-во уникальных значений для каждого из полей. 
unique_val=df.nunique()
unique_val

show_id         7787
type               2
title           7787
director        4049
cast            6831
country          681
date_added      1565
release_year      73
rating            14
duration         216
listed_in        492
description     7769
dtype: int64

In [5]:
df['type'].value_counts()

Movie      5377
TV Show    2410
Name: type, dtype: int64

In [6]:
# df.iloc[search_moves('Warrior Nun')]

### Строим рекомендации.

#### Функции для предобработки текста вынесены в отдельный модуль. 

In [7]:
from src.utils import clean_text, lemmatization,correct_text,join_collumns

df['description'] = df['description'].fillna("") # заменяем nan на пустые строки. 

Выполним предобработку текста перед тем как обучить TF-IDF 
1. Уберем из текста стоп-слова, 
2. выполним лематизацию текста.

In [8]:
df['description'] = df['description'].apply(lambda x: clean_text(x))

  text = re.sub("[0-9]|[-—.,:;_%©«»?*!@#№$^•·&()]|[+=]|[[]|[]]|[/]|", '', text)


In [9]:
%% time 
df['description'] = df['description'].apply(lambda x: lemmatization(x))

UsageError: Cell magic `%%` not found.


Для того чтобы поиск выполнялся более корректно, дополним описание фильма списком актеров, режисером и установленными тегами.

In [10]:
df['cast'] = df['cast'].apply(lambda x: correct_text(x))
df['director'] = df['director'].apply(lambda x: correct_text(x))
df['listed_in'] = df['listed_in'].apply(lambda x: correct_text(x))
df.head(3)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"joão miguel, bianca comparato, michel gomes, r...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"international tv shows, tv dramas, tv sci-fi &...",in a future where the elite inhabit an island ...
1,s2,Movie,7:19,jorge michel grau,"demián bichir, héctor bonilla, oscar serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"dramas, international movies",after a devastating earthquake hits mexico cit...
2,s3,Movie,23:59,gilbert chan,"tedd chan, stella chung, henley hii, lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"horror movies, international movies",when an army recruit is found dead his fellow ...


Добавим слова по которым будет определяться сходство фильмов в столбец 'combined'. 

In [11]:
join_collumns(df,'combined',['type','director','cast','country','rating','listed_in','description'])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[output_collumn][ind] = ','.join(text_list)


In [12]:
df.head(3)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,combined
0,s1,TV Show,3%,,"joão miguel, bianca comparato, michel gomes, r...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"international tv shows, tv dramas, tv sci-fi &...",in a future where the elite inhabit an island ...,"tv show,joão miguel, bianca comparato, michel ..."
1,s2,Movie,7:19,jorge michel grau,"demián bichir, héctor bonilla, oscar serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"dramas, international movies",after a devastating earthquake hits mexico cit...,"movie,jorge michel grau,demián bichir, héctor ..."
2,s3,Movie,23:59,gilbert chan,"tedd chan, stella chung, henley hii, lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"horror movies, international movies",when an army recruit is found dead his fellow ...,"movie,gilbert chan,tedd chan, stella chung, he..."


In [13]:
# При сосщдании токенов бкдем использовать в качестве разделителя ",".   
tfv = TfidfVectorizer(
                      # min_df = 3,
                      # max_features = None,
                      # analyzer = 'word',
                      # ngram_range = (1,2), 
                      token_pattern = "r[^,.]+", 
                      # token_pattern = "r\w+",
                      # stop_words = 'english'
)

In [14]:
# tfv.tokenizer

In [15]:
tfv_matrix = tfv.fit_transform(df['combined'])

In [16]:
sig = sigmoid_kernel(tfv_matrix,tfv_matrix)
# print(sig[1])
indices = pd.Series(df.index,index = df['title']).drop_duplicates()

In [17]:
# Организуем поиск по фильму.
# Будем заголовки фильмов, в которых встречаются слова из поискового запроса.
def search_moves(text_title = ''):
    id_list = []
    text_title = text_title.lower()  
    for ind, title in enumerate(df['title']):
        title_lower = title.lower()
        if (title_lower.find(text_title)>-1):
            id_list.append(title)
    return id_list

# Проверка !!! 
print(search_moves('terminator'))

['Terminator 3: Rise of the Machines', 'Terminator Salvation']


In [19]:
search_moves('sabrina')

["BoJack Horseman Christmas Special: Sabrina's Christmas Wish",
 'Chilling Adventures of Sabrina',
 'Sabrina']

Проверим работу рекомендальеной системы. 

In [20]:
def recommend(title,n):
    idx = indices[title]
    sim_scores = list(enumerate(sig[idx]))
    sim_scores = sorted(sim_scores,key = lambda x:x[1], reverse = True)
    sim_scores = sim_scores[1:n+1]
    movies_indices = [i[0] for i in sim_scores]
    return df['title'].iloc[movies_indices]

In [21]:
recommend('Terminator 3: Rise of the Machines',5)

6800         The Rainmaker
5794              Stardust
7681            Wyatt Earp
5544    Shattered Memories
330              Aftermath
Name: title, dtype: object

In [22]:
recommend('Chilling Adventures of Sabrina',5)

6870                                   The Silence
6876                                    The Sinner
5447    Seal Team Six: The Raid on Osama Bin Laden
6271                                  The Daughter
1367                         Christmas Inheritance
Name: title, dtype: object

In [23]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(tfv_matrix, tfv_matrix)
# cosine_sim

In [24]:
def recommend_1(title,n):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores,key = lambda x:x[1], reverse = True)
    sim_scores = sim_scores[1:n+1]
    movies_indices = [i[0] for i in sim_scores]
    return df['title'].iloc[movies_indices] #df.iloc[movies_indices]

# С косинусной мерой - результат тот же самый.  
recommend_1('Chilling Adventures of Sabrina',5)

6870                                   The Silence
6876                                    The Sinner
5447    Seal Team Six: The Raid on Osama Bin Laden
6271                                  The Daughter
1367                         Christmas Inheritance
Name: title, dtype: object

Сохраняем обученную модель.

In [25]:
with open("data/tfidf_netflix.dill", "wb") as f:
    dill.dump(tfv_matrix, f)