# Data Collection

Dataset: [here](https://www.kaggle.com/datasets/satpreetmakhija/netflix-movies-and-tv-shows-2021?resource=download&select=netflixData.csv)

In [34]:
# Import Libraries
import pandas as pd

# Read data
df = pd.read_csv("netflixData.csv")

# Display top data
df.head()

Unnamed: 0,Show Id,Title,Description,Director,Genres,Cast,Production Country,Release Date,Rating,Duration,Imdb Score,Content Type,Date Added
0,cc1b6ed9-cf9e-4057-8303-34577fb54477,(Un)Well,This docuseries takes a deep dive into the luc...,,Reality TV,,United States,2020.0,TV-MA,1 Season,6.6/10,TV Show,
1,e2ef4e91-fb25-42ab-b485-be8e3b23dedb,#Alive,"As a grisly virus rampages a city, a lone man ...",Cho Il,"Horror Movies, International Movies, Thrillers","Yoo Ah-in, Park Shin-hye",South Korea,2020.0,TV-MA,99 min,6.2/10,Movie,"September 8, 2020"
2,b01b73b7-81f6-47a7-86d8-acb63080d525,#AnneFrank - Parallel Stories,"Through her diary, Anne Frank's story is retol...","Sabina Fedeli, Anna Migotto","Documentaries, International Movies","Helen Mirren, Gengher Gatti",Italy,2019.0,TV-14,95 min,6.4/10,Movie,"July 1, 2020"
3,b6611af0-f53c-4a08-9ffa-9716dc57eb9c,#blackAF,Kenya Barris and his family navigate relations...,,TV Comedies,"Kenya Barris, Rashida Jones, Iman Benson, Genn...",United States,2020.0,TV-MA,1 Season,6.6/10,TV Show,
4,7f2d4170-bab8-4d75-adc2-197f7124c070,#cats_the_mewvie,This pawesome documentary explores how our fel...,Michael Margolis,"Documentaries, International Movies",,Canada,2020.0,TV-14,90 min,5.1/10,Movie,"February 5, 2020"


# Exploratory Data Analysis (EDA)

**Check information about dataframe**

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5967 entries, 0 to 5966
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Show Id             5967 non-null   object 
 1   Title               5967 non-null   object 
 2   Description         5967 non-null   object 
 3   Director            3903 non-null   object 
 4   Genres              5967 non-null   object 
 5   Cast                5437 non-null   object 
 6   Production Country  5408 non-null   object 
 7   Release Date        5964 non-null   float64
 8   Rating              5963 non-null   object 
 9   Duration            5964 non-null   object 
 10  Imdb Score          5359 non-null   object 
 11  Content Type        5967 non-null   object 
 12  Date Added          4632 non-null   object 
dtypes: float64(1), object(12)
memory usage: 606.1+ KB


**Check how many rows has null value**

In [36]:
print(df.isnull().sum())

Show Id                  0
Title                    0
Description              0
Director              2064
Genres                   0
Cast                   530
Production Country     559
Release Date             3
Rating                   4
Duration                 3
Imdb Score             608
Content Type             0
Date Added            1335
dtype: int64


# Data Preparation

**Select used columns**

In [37]:
# Select used columns into new dataframe
df1 = df[["Title", "Description", "Content Type", "Genres"]]

# Set index with column title
df1.set_index('Title', inplace=True)

# Display top data
df1.head()

Unnamed: 0_level_0,Description,Content Type,Genres
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
(Un)Well,This docuseries takes a deep dive into the luc...,TV Show,Reality TV
#Alive,"As a grisly virus rampages a city, a lone man ...",Movie,"Horror Movies, International Movies, Thrillers"
#AnneFrank - Parallel Stories,"Through her diary, Anne Frank's story is retol...",Movie,"Documentaries, International Movies"
#blackAF,Kenya Barris and his family navigate relations...,TV Show,TV Comedies
#cats_the_mewvie,This pawesome documentary explores how our fel...,Movie,"Documentaries, International Movies"


**Remove rows if has null value**

In [39]:
# Delete rows with null value
df1 = df1.dropna()

# Check how many rows has null value after delete
print(df1.isnull().sum())

Description     0
Content Type    0
Genres          0
dtype: int64


**String Preparation**

In [40]:
# Function for String preparation
def separate(texts): # Remove ","
    t = []
    for text in texts.split(','):
        t.append(text.replace(' ', '').lower())
    return ' '.join(t)

def remove_space(texts): # Remove " "
    return texts.replace(' ', '').lower()

def remove_punc(texts): # Remove ". : etc"
    return texts.translate(str.maketrans('','',string.punctuation)).lower()

# Run function
df1['Content Type'] = df1['Content Type'].apply(remove_space)
df1['Genres'] = df1['Genres'].apply(separate)
df1['Description'] = df1['Description'].apply(remove_punc)

# Display top data
df1.head()

Unnamed: 0_level_0,Description,Content Type,Genres
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
(Un)Well,this docuseries takes a deep dive into the luc...,tvshow,realitytv
#Alive,as a grisly virus rampages a city a lone man s...,movie,horrormovies internationalmovies thrillers
#AnneFrank - Parallel Stories,through her diary anne franks story is retold ...,movie,documentaries internationalmovies
#blackAF,kenya barris and his family navigate relations...,tvshow,tvcomedies
#cats_the_mewvie,this pawesome documentary explores how our fel...,movie,documentaries internationalmovies


**Combine columns into 1 column** 

In [41]:
# Create column
df1['words'] = ''

# Combine all the words into 1 column
for i, row in enumerate(df1.iterrows()):
    string = ''
    for col in df1.columns:
        if row[1][col] == '':
            continue
        else:
            string += row[1][col] + ' '
            df1['words'][i] = string.strip()

df1.drop(df1.columns[:-1], axis=1, inplace=True)

# Display top data
df1.head()

Unnamed: 0_level_0,words
Title,Unnamed: 1_level_1
(Un)Well,this docuseries takes a deep dive into the luc...
#Alive,as a grisly virus rampages a city a lone man s...
#AnneFrank - Parallel Stories,through her diary anne franks story is retold ...
#blackAF,kenya barris and his family navigate relations...
#cats_the_mewvie,this pawesome documentary explores how our fel...


# Start Calculate Recommendation

**Use Cosine Similarity**

In [42]:
# Import Libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Calculate TF IDF
tfid = TfidfVectorizer()
tfid_matrix = tfid.fit_transform(df1['words'])

# Calculate Cosine Similarity
cosine_sim = cosine_similarity(tfid_matrix, tfid_matrix)
cosine_sim

array([[1.        , 0.00583422, 0.00929961, ..., 0.04319535, 0.01663563,
        0.06057268],
       [0.00583422, 1.        , 0.01635644, ..., 0.        , 0.05882956,
        0.00894085],
       [0.00929961, 0.01635644, 1.        , ..., 0.00893794, 0.03183241,
        0.01987033],
       ...,
       [0.04319535, 0.        , 0.00893794, ..., 1.        , 0.08126157,
        0.02393965],
       [0.01663563, 0.05882956, 0.03183241, ..., 0.08126157, 1.        ,
        0.00705325],
       [0.06057268, 0.00894085, 0.01987033, ..., 0.02393965, 0.00705325,
        1.        ]])

**Code for Recommendation**

In [49]:
# Import Libraries
import warnings
warnings.filterwarnings('ignore')

# Create new dataframe
df2 = df[['Title', 'Content Type']]

# Create function for recommendation
def recommendation(title, total_result=5, threshold=0.5):
    # Get the index
    idx = df2[df2['Title'] == title].index[0]
    # Create a new column for similarity, the value is different for each title you input
    df2['similarity'] = cosine_sim[idx]
    df3 = df2.sort_values(by='similarity', ascending=False)[1:total_result+1]
    
    # You can set a threshold if you want to norrow the result down 
    #sort_final_df = sort_final_df[sort_final_df['similarity'] > threshold]
    
    # Is the title a movie or tv show?
    movies = df3['Title'][df3['Content Type'] == 'Movie']
    tv_shows = df3['Title'][df3['Content Type'] == 'TV Show']
    
    if len(movies) != 0:
        print('Similar Movie(s) list:')
        for i, movie in enumerate(movies):
            print('{}. {}'.format(i+1, movie))
        print()
    else:
        print('Similar Movie(s) list:')
        print('-\n')
        
    if len(tv_shows) != 0:
        print('Similar TV_show(s) list:')
        for i, tv_show in enumerate(tv_shows):
            print('{}. {}'.format(i+1, tv_show))
    else:
        print('Similar TV_show(s) list:')
        print('-')

**Run Recommendations**

In [51]:
recommendation('Narcos')

Similar Movie(s) list:
-

Similar TV_show(s) list:
1. Narcos: Mexico
2. El Chapo
3. Somos.
4. Top Boy
5. Dealer
