# **Ερωτήματα (Part 1)**
---




##*__Loading our datasets__*

In [None]:
from google.colab import drive
from os import chdir

drive.mount("/content/drive")
chdir("/content/drive/MyDrive/my_project1")

In [None]:
#Imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import re 

%matplotlib inline

In [None]:
netfx_df = pd.read_csv("netflix_titles.csv")
imdb_mov_df = pd.read_csv("imdb-movies.csv", low_memory=False)
imdb_rat_df = pd.read_csv("imdb-ratings.csv")

##*__Data Preprocessing__*

###Firstly we will get familiar with our data

In [None]:
netfx_df.head()

In [None]:
imdb_mov_df.head(2)

In [None]:
print('~~~~~NETFLIX DATASET~~~~~')
print(netfx_df.info())
print()
print('~~~~~IMDB MOVIES DATASET~~~~~')
print(imdb_mov_df.info())
print()
print('~~~~~IMDB RATINGS DATASET~~~~~')
print(imdb_rat_df.info())

###We can see that some columns, which we will use later on(country, director, cast), have missing values so we will try filling them with the corresponding values from the imdb dataset <br/>

In [None]:
# function that fills null values with the values found in the imdb dataframe
def fillNullValues(netflixDf, imdbDf, netflixColumn, imdbColumn):
    # first get the rows that have null values in the column
    nullRows = netflixDf[netflixDf['type'].isin(['Movie']) &\
                        netflixDf[netflixColumn].isnull()].copy()
    # then join the two dataframes by title
    joinedRows = imdbDf.set_index('title').join(nullRows.set_index('title'),
                                                rsuffix='_net',
                                                how='inner')
    
    # change imdb country abbreviations to full names
    joinedRows['country'] = joinedRows['country']\
        .str.replace('USA', 'United States', regex=False)
    joinedRows['country'] = joinedRows['country']\
        .str.replace('UK', 'United Kingdom', regex=False)


    # make sure a row contains info about the same 
    # movie from the imdb and netflix dataset
    joinedRows = joinedRows[(joinedRows['year'].astype(np.int64) == joinedRows['release_year']) &\
                (joinedRows['country'] == joinedRows['country_net'])]


    # add the missing values to the netflix dataframe
    for i,row in enumerate(joinedRows.values):
        # get the missing value
        columnIdx = list(joinedRows.columns).index(imdbColumn)
        columnValue = row[columnIdx]
        # find the location of the missing value in the netflix dataframe
        title = joinedRows.index[i]
        rowIdx = netflixDf[netflixDf['title'] == title].index
        # fill in the missing value
        netflixDf.loc[rowIdx, netflixColumn] = columnValue


In [None]:
fillNullValues(netfx_df, imdb_mov_df, 'director', 'director')
fillNullValues(netfx_df, imdb_mov_df, 'cast', 'actors')
fillNullValues(netfx_df, imdb_mov_df, 'country', 'country')

netfx_df.info()

###Unfortunately we couldn't make any progress, so we will have to rely on dropping the empty rows because the data we are working with are mostly categorical

##*__Question \#1__*

>Ποιό είδος υπερτερεί, οι ταινίες ή οι σειρές;

In [None]:
# get type column
type_ser = netfx_df['type']
# count appearances of each content type on the platform
print(f"Counts per type:\n{type_ser.value_counts()}\n")

# plot the result
def setAxesContent(ax, title, xlabel, ylabel):
    ax.set_title(title)
    ax.set_xlabel(xlabel)
    ax.set_ylabel(ylabel)

sns.set_theme(context='notebook', style='whitegrid',
              palette='deep', font_scale=1.5)

plt.figure(figsize=(8,6))

ax = sns.countplot(x='type',data=netfx_df)
setAxesContent(ax, 'Most Popular Content', 'Content type', 'Content amount')

plt.show(ax)

##*__Question \#2__*

>Τα τελευταία χρόνια το netflix επενδύει περισσότερο σε ταινίες ή σε σειρές;

In [None]:
netfx_df.columns

###The column we are interested in is *'date_added'*. We will focus on the last 3 years

In [None]:
# only need the content type and the year it was added to the platform
df = netfx_df.loc[:, ['type','date_added']].copy()
df.dropna(inplace=True)

# convert to supported date data type so that we can perform arithmetic
date_added_ser = df['date_added'].apply(pd.to_datetime)
print(date_added_ser.max())

###From the above output the years we are interested in are 2019, 2020 and 2021

In [None]:
# extract the year from every date
date_added_ser = date_added_ser.apply(lambda x: x.year)
df['year_added'] = date_added_ser

# drop if year is older than 2019
df = df[df['year_added'] >= 2019]

df.groupby('year_added')['type'].value_counts()

In [None]:
# plot the result
plt.figure(figsize=(12,6))
# tidy dataset for plotting
df.rename(columns={'type':'Content type'}, inplace=True)

ax = sns.countplot(x='year_added', data=df, hue='Content type')
setAxesContent(ax, 'Most popular content in last three years',
               'Year added to the platform', 'Content amount')

plt.show(ax)

###In conclusion, from the graph we can clearly see that netflix invests more into movies

##*__Question \#3__*

>Ποιά χώρα έχει το περισσότερο περιεχόμενο;

In [None]:
netfx_df.columns.array

###The column we are interested in is *'country'*
###Now we need to 'unwrap' the country column since a movie can be produced in multiple countries

In [None]:
# function that locates comma separated values in a given column 
# and splits the corresponding row into multiple rows
def unwrapComSepColumn(df, columnName):
    def getValues(row):
        nonlocal columnName
        
        # return a list of the comma separated values
        if ',' in row[columnName]:
            col_str = row[columnName]
            col_splitted = col_str.split(',')
            return pd.Series(data=[col_splitted], index=[columnName])
        
        # row is not comma separated, return NaN
        return pd.Series(index=[columnName], dtype='object')

    valuesUnwrapped = df.apply(getValues, axis=1)
    
    # for every value found create a new row in initial dataframe
    tmp = pd.DataFrame(columns=df.columns.array)
    for index, row in valuesUnwrapped.iterrows():
        if row.isnull().all():
            continue

        for val in row[columnName]:
            val = val.strip()
            if val == "":
                continue
                
            newRow = df.loc[index,:].copy()
            newRow[columnName] = val
            tmp = tmp.append(newRow,ignore_index=True)

    df = df.append(tmp,ignore_index=True)
    # now drop the rows with the concatenated values
    df = df[~df[columnName].str.contains(',', regex=False)]

    return df

In [None]:
df = netfx_df[['country']].copy()
df.dropna(inplace=True)

df = unwrapComSepColumn(df, 'country')

# find top ten countries with most content
top_ten = df['country'].value_counts().head(10).index.array

# only keep those countries
df = df[df['country'].isin(top_ten)]

In [None]:
print(f"Country with most content:\n{df['country'].value_counts().head(1)}\n")

# plot the result
plt.figure(figsize=(18,10))
plt.xticks(rotation=45)

ax = sns.countplot(x='country', data=df, order = top_ten)
setAxesContent(ax, 'Top Ten countries with most content',
                'Country', 'Content amount')
ax.set_yticks(np.linspace(0,3400,num=18))

plt.show(ax)

###United States produce the most content

##*__Question \#4__*

>Τι είδους περιεχόμενο έχει κάθε χώρα;

###For this question we are going to need columns from netflix and imdb dataframes i.e. the *'country'* and the *'genre'* columns

In [None]:
df_net = netfx_df[['type','title','country','release_year']].copy()
# drop movies that have no country specified
df_net.dropna(inplace=True)

# only keep the movies since the imdb dataset contains movies
df_net = df_net[df_net['type'] == 'Movie']
df_net.drop('type', axis=1, inplace=True)

df_imdb = imdb_mov_df[['original_title','genre', 'year', 'country']].copy()
df_imdb.columns = ['title', 'genre', 'year', 'imdb_country']
# drop movies that have no country specified
df_imdb.dropna(inplace=True)

# change abbreviations to full names
df_imdb['imdb_country'] = df_imdb['imdb_country']\
    .str.replace('USA', 'United States', regex=False)
df_imdb['imdb_country'] = df_imdb['imdb_country']\
    .str.replace('UK', 'United Kingdom', regex=False)

merged = pd.merge(df_net, df_imdb, how='inner', on='title')

# keep movies that have the same release year only
# to avoid movie remakes and sequels
merged['year'] = merged['year'].astype(np.int64)
merged = merged[(merged['year'] == merged['release_year']) &\
                (merged['country'] == merged['imdb_country'])]

# don't need these columns anymore
merged.drop(['year','release_year','imdb_country'], axis=1, inplace=True)

# unwrap country column
merged = unwrapComSepColumn(merged, 'country')

# keep top five countries with most content
top_five_countries = merged['country'].value_counts().head().index.to_list()
merged = merged[merged['country'].isin(top_five_countries)]

merged.head()

In [None]:
# find movie counts per genre in each country group
by_country = merged.groupby('country')

def countGenres(country_gr):
    # keep a dictionary per genre along with the number of movies 
    # of that genre
    gnr_dict = {}
    # get amount of movies per genre for a specific country group
    def getGenres(genre_str):
        nonlocal gnr_dict
        genres = genre_str.split()
        for genre in genres:
            genre = genre.replace(",","")
            if genre not in gnr_dict.keys(): # new genre
                gnr_dict[genre] = 1
            else: # one more movie of the same genre
                gnr_dict[genre] += 1

    country_gr['genre'].apply(getGenres)

    # unwrap the dictionary
    genres, counts = zip(*gnr_dict.items())
    # return a row of all counts per genre
    return pd.Series(counts, index=genres)

df = by_country.apply(countGenres).unstack(level=1)
df.fillna(value=0, inplace=True)
df = df.astype(np.int64)
df

In [None]:
# prepare data for plotting
df.reset_index(inplace=True)
tidy = df.melt(id_vars='country')
tidy.sort_values('value', inplace=True, ascending=False)
tidy.rename(columns={'variable':'Genre'}, inplace=True)

plt.figure(figsize=(28,12))

ax = sns.barplot(x='country',y='value', hue='Genre', data=tidy, palette='deep')
setAxesContent(ax, 'Popular genres per country', 'Countries', 'Content Amount')
ax.set_yticks(np.linspace(0,700,num=21))

plt.show(ax)

##*__Question \#5__*

> Ετοιμάστε γραφήματα που δειχνουν τους ηθοποιούς με τις περισσότερες ταινίες σε κάθε χώρα. Κάντε το ίδιο και για τις σειρές

###For this question we are going to need the *'type', 'country' and 'cast'* columns from the netflix dataframe

In [None]:
df = netfx_df[['type','country','cast']].copy()
# drop the rows with null values
df.dropna(inplace=True)

# convert concatenated columns to strings 
df['country'] = df['country'].astype(str)
df['cast'] = df['cast'].astype(str)

print(df.info())
print(df.head())

In [None]:
df = unwrapComSepColumn(df, 'country')
# only keep the top ten countries with most content
top_ten_countries = df['country'].value_counts().head(10).index.to_list()
df = df[df['country'].isin(top_ten_countries)]
df = unwrapComSepColumn(df, 'cast')

print(df.info())
print(df.head())

In [None]:
# split movies and series
df_movies = df[df['type'] == 'Movie'].copy()
df_series = df[df['type'] == 'TV Show'].copy()
# don't need the columns anymore
df_movies.drop('type', axis=1, inplace=True)
df_series.drop('type', axis=1, inplace=True)

In [None]:
def getTop10ActorsPerCntr(df):
    # for each country get number of actor appearances
    counts_ser = df.groupby('country')['cast'].value_counts()

    tmp_dict = {}
    # from every country keep top 10 actors with most appearances
    for country in counts_ser.index.get_level_values(0).unique():
        # find the top 10 actors of specific country
        top_10 = counts_ser[country].head(10).values
        # only keep rows that belong to current country
        bools = list(map(lambda x: x == country, counts_ser.index.get_level_values(0)))
        
        tmp_dict.update(counts_ser[counts_ser.isin(top_10) & bools].head(10).to_dict())

    top_10_act = pd.Series(tmp_dict)

    return top_10_act.unstack(level = 1)

In [None]:
# calculate top actor appearances
df_movies = getTop10ActorsPerCntr(df_movies)
df_series = getTop10ActorsPerCntr(df_series)

df_movies.reset_index(inplace=True)
df_series.reset_index(inplace=True)

df_movies.head()

In [None]:

def plotTopActors(df):

    # prepare dataset for plotting
    tidy = df.melt(id_vars='index')
    tidy.dropna(inplace=True)
    tidy.sort_values('value', inplace=True, ascending=False)
    tidy.rename(columns={'variable':'Actor', 'index':'Country'}, inplace=True)

    countries = tidy['Country'].unique()
    num_countries = tidy['Country'].nunique()

    fig, _ = plt.subplots(squeeze=False, figsize=(24,30))
    fig.tight_layout()
    sns.set_theme(context='notebook', style='whitegrid',
                palette='deep', font_scale=1.0)
    
    for row in range(5):
        for col in range(2):
            # get a country
            country_idx = row*2 + col
            if country_idx > num_countries-1:
                break
            ax = plt.subplot2grid((5, 2), (row, col))
            country = countries[country_idx]
            tmp = tidy[tidy['Country'] == country]
            sns.barplot(y='Actor',x='value', ax=ax, data=tmp, palette='deep')
            setAxesContent(ax, f'Popular actors in {country}', 'Appearances', '')
            
    plt.show()


### Plotting for the Movies

In [None]:
plotTopActors(df_movies)

###Plotting for the TV Shows

In [None]:
plotTopActors(df_series)

##*__Question \#6__*

>Το netflix υποστηρίζει ότι παρέχει πλούσιο περιεχόμενο για όλες τις ηλικίες.Αληθεύει αυτό; Φτιάξτε ένα γράφημα το οποίο συγκεντρώνει το πλήθος των ταινιων ανάλογα με την προτεινόμενη ηλικία (αναφερόμαστε στη στήλη rating). Τα όρια καθορίζονται σύμφωνα με τον παρακάτω πίνακα. Κάντε το ίδιο γράφημα και για τις σειρές.

###We will only need the *'type' and 'rating'* columns for this question

In [None]:
df = netfx_df[['type', 'rating']].copy()
df.dropna(inplace = True)

In [None]:
# function that turns values from rating column into age categories
def getRatingAges(df):
    bins = {'little_kids': ['G', 'TV-Y', 'TV-G'],
            'older_kids': ['PG', 'TV-Y7', 'TV-Y7-FV', 'TV-PG'],
            'teens': ['PG-13', 'TV-14']
            }
    def getBin(rating):
        nonlocal bins
        if rating in bins['little_kids']:
            return 'Little Kids'
        elif rating in bins['older_kids']:
            return 'Older Kids'
        elif rating in bins['teens']:
            return 'Teens'
        else:
            return 'Mature'

    df['ages'] = df['rating'].apply(getBin)

    return df

In [None]:
df = getRatingAges(df)

df.head()

In [None]:
def plotAges(group):
    if group.name == 'Movie':
        ax = plt.subplot2grid((1, 2), (0, 0))
    else:
        ax = plt.subplot2grid((1, 2), (0, 1))

    sns.countplot(x='ages', data=group, ax=ax)

    if group.name == 'Movie':
        setAxesContent(ax, 'Movies per Age Category','Age Groups', 'Content amount')
    else:
        setAxesContent(ax, 'TV Shows per Age Category', 'Age Groups', 'Content amount')

fig, _ = plt.subplots(squeeze=False, figsize=(14,6))
fig.tight_layout()
sns.set_theme(context='notebook', style='whitegrid',
                palette='deep', font_scale=1.5)

df.groupby('type').apply(plotAges)  
plt.show()

##*__Question \#7__*

>Αν ένας παραγωγός ήθελε να έχει υψηλή ακροαματικότητα, σκέφτεται ότι θα ήταν ίσως καλύτερα να βγάλει την ταινία του σε μία εποχή που δεν υπάρχει ανταγωνισμός. Κάντε ένα γράφημα με το περιεχόμενο που προστίθεται ανά μήνα για να τον βοηθήσετε να επιλέξει τη σωστή χρονική περίοδο.

In [None]:


df = netfx_df[['date_added','type']].copy()
df.dropna(inplace=True)
# only keep the movies
df = df[df['type'] == 'Movie']

plt.figure(figsize=(12,6))
plt.xticks(rotation= 45)

df['month'] = df['date_added'].apply(lambda x: re.sub("[^A-Za-z]*","",x))
ax = sns.countplot(x='month', data=df, palette ='deep',\
                   order= df['month'].value_counts().index.to_list())
setAxesContent(ax, 'Content added per month', 'Month', 'Content amount')
ax.set_yticks(np.linspace(0,600,num=11))

plt.show()

###From the above we can see that the producer should publish his movie on February

##*__Question \#8__*

>Ετοιμάστε ένα γράφημα που παρουσιάζει συγκεντρωτικά τα είδη του περιεχομένου (αναφερόμαστε στη στήλη listed_in).

In [None]:
df = netfx_df[['listed_in']].copy()

# the listed_in column has a lot of concatenated values
# unwrap it
df = unwrapComSepColumn(df, 'listed_in')

In [None]:
# plot the results
plt.figure(figsize=(18,14))

sns.set_theme(context='notebook', style='whitegrid', font_scale=1.5)

ax = sns.countplot(y='listed_in',data=df,\
                   order= df['listed_in'].value_counts().index, palette='deep')
setAxesContent(ax, 'Content per category Netflix', 'Content amount', 'Category')
ax.set_xticks(np.linspace(0,2500,num=21))

plt.show(ax)

##*__Question \#9__*

>Μελετηστε τους σκηνοθέτες ανά χώρα και παρουσιάστε σχετικά γραφήματα.

In [None]:
# we will need the director and the country column
df = netfx_df[['director', 'country']].copy()
df.dropna(inplace=True)

df = unwrapComSepColumn(df, 'country')
df = unwrapComSepColumn(df, 'director')

print(df.info())
print(df.head())

In [None]:
def plotDirectors(df):
    row, col = 0, 0
    def plotMethod(group):
        nonlocal row, col
        ax = plt.subplot2grid((5, 2), (row, col))

        # find ten directors with most content
        ten_popular = group['director'].value_counts().head(10).index.to_list()
        country = group.name
        group = group[group['director'].isin(ten_popular)]
        
        # plot the result
        sns.countplot(y='director', data=group, ax=ax,\
                      order = group['director'].value_counts().index)
        
        setAxesContent(ax, f'Directors with most movies in {country}','','')
        ax.set_xticks(np.linspace(0,15,num=16))

        # update axes indices for next group
        if col == 1:
            row += 1
            col = 0
        else:
            col += 1


    fig, _ = plt.subplots(squeeze=True, figsize=(24,30))
    fig.tight_layout()
    sns.set_theme(context='notebook', style='whitegrid',
                    palette='deep', font_scale=1.0)

    df.groupby('country').apply(plotMethod)  
    plt.show()

# plot top ten countries
top_ten_countries = df['country'].value_counts().head(10).index.to_list()
df = df[df['country'].isin(top_ten_countries)]
plotDirectors(df)

##*__Question \#10__*

>Μελετήστε τις σειρές και παρουσιάστε ένα γράφημα που τις δείχνει ανάλογα με το αριθμό των seasons.

In [None]:
df = netfx_df[['title', 'type', 'duration', 'date_added', 'country']].copy()
df.dropna(inplace=True)
# only keep the TV series
df = df[df['type'] == 'TV Show']

# get amount of seasons in new column
df['seasons'] = df['duration'].apply(lambda x: (np.int64)(re.sub("[^0-9]*","",x)))

df['date_added'] = pd.to_datetime(df['date_added'])
df.head()

###Plot all the TV Shows w.r.t their seasons and the year they were added to the platform, in an interactive plot

In [None]:
import altair as alt

df.rename(columns={'title':'Show title', 'country':'Country',
                   'seasons':'Total Seasons','date_added':'Date added'},inplace=True)

alt.Chart(df, width=1000, height=700).mark_circle(size=50).encode(
    x='Date added',
    y='Total Seasons',
    color='Country',
    tooltip=['Show title', 'Country', 'Total Seasons', 'Date added']
).interactive()

##*__Question \#11__*

>Αξιοποιήστε τα υπόλοιπα αρχεία της εργασίας για να απαντήσετε στο ερώτημα:
Ποιές είναι οι ταινίες με την πιο υψηλή βαθμολογία; Θα χρειαστεί να συννενώσετε
κατάλληλα τα αρχεία που σας δίνονται ώστε να κρατήσετε τις ταινίες του netflix για τις οποίες υπάρχουν ratings στο IMBD

In [None]:
#We want only 'imdb_title_id' and 'title' from the imdb movies data frame
dfm = imdb_mov_df[['imdb_title_id','title','year','country']].copy()
#We want only 'imdb_title_id' and 'weighted_average_vote' from the imdb ratings data frame
dfr = imdb_rat_df[['imdb_title_id','weighted_average_vote',\
                   'allgenders_30age_avg_vote','allgenders_18age_avg_vote',\
                   'allgenders_45age_avg_vote','females_allages_avg_vote',\
                   'males_allages_avg_vote']].copy()

#Join the two new data frames on the column that they have in common ('imdb_title_id')
df = dfm.join(dfr.set_index('imdb_title_id'), on='imdb_title_id', how='inner')

#We want to connect this data frame to netflix data frame so we keep only 'type' and 'title'
dfn = netfx_df[['type','title', 'release_year', 'country']].copy()
#Keep only the movies
dfn = dfn[dfn['type'] == 'Movie']

#Join on 'title' to a final data frame
df = df.join(dfn.set_index('title'), on='title', how='inner', rsuffix='_net')

# Make sure the columns from netflix and imdb datasets refer to the same movie
# change imdb country abbreviations to full names
df['country'] = df['country']\
    .str.replace('USA', 'United States', regex=False)
df['country'] = df['country']\
    .str.replace('UK', 'United Kingdom', regex=False)

df = df[(df['year'].astype(np.int64) == df['release_year'].astype(np.int64)) &\
        (df['country'] == df['country_net'])]

df.sort_values('weighted_average_vote', ascending=False, inplace=True)
df.head()

In [None]:
sns.set_theme(context='notebook', style='whitegrid', font_scale=1.5)
plt.figure(figsize=(15,15))

ax = sns.barplot(y='title',x='weighted_average_vote', data=df.head(25), palette='deep')
ax.set_xticks(np.linspace(0,10,num=21))
setAxesContent(ax, 'Top IMDb-Rating Movies on Netflix', 'Average Rating', 'Netflix Movie')
plt.show(ax)

###Now we will show ratings in different ages and genders

In [None]:
# only keep the columns we are going to need
df.drop(['type','imdb_title_id','year','release_year','weighted_average_vote',\
         'country', 'country_net'], axis=1, inplace=True)
df.rename(columns={'allgenders_18age_avg_vote':'Age 18',
                   'allgenders_30age_avg_vote':'Age 30',
                   'allgenders_45age_avg_vote':'Age 45',
                   'females_allages_avg_vote':'Females',
                   'males_allages_avg_vote':'Males'}, inplace=True)
df = df.melt(id_vars='title')

df.head()

In [None]:
def plotRatings(df):
    row, col = 0, 0
    def plotMethod(group):
        nonlocal row, col
        ax = plt.subplot2grid((3, 2), (row, col))
        ttl = group.name
        
        # sort in order to find top 10 most rated movies
        group = group.sort_values('value', ascending=False)
        
        # plot the result
        sns.barplot(y='title', x='value', palette='deep', data=group.head(10), ax=ax)
        setAxesContent(ax, f'Top rated movies in {ttl}','Rating','')
        ax.set_xticks(np.linspace(0,10,num=21))

        # update axes indices for next group
        if col == 1:
            row += 1
            col = 0
        else:
            col += 1


    fig, _ = plt.subplots(squeeze=True, figsize=(24,15))
    fig.tight_layout()
    sns.set_theme(context='notebook', style='whitegrid', font_scale=1.0)

    df.groupby('variable').apply(plotMethod)  
    plt.show()

# plot ratings per age and gender
plotRatings(df)

# **Part 2 - Recommendation system**
---

## *__Data Preprocessing__*

In [None]:
import nltk

# Keep only the columns we need
df = netfx_df[['show_id', 'title', 'description']].copy()
df.head()

In [None]:
# Importing the WordNetLemmatizer module from nltk.stem
nltk.download('wordnet')

from nltk.corpus.reader.wordnet import NOUN
from nltk.corpus import wordnet

class WordNetLemmatizer(object):
  def __init__(self):
        pass

  def lemmatize(self, word, pos=NOUN):
        lemmas = wordnet._morphy(word, pos)
        return min(lemmas, key=len) if lemmas else word


  def __repr__(self):
        return "<WordNetLemmatizer>"



# unload wordnet
def teardown_module(module=None):
    from nltk.corpus import wordnet

    wordnet._unload()

###The preprocessing steps we will apply are:
1. Convert all letters to lowercase
2. Remove punctuation
3. Apply lemmatization

In [None]:
# Create WordNetLemmatizer object
lemmatizer = WordNetLemmatizer()

# Text pre-processing function using regular expresions and lemmatizer
def preProc(desc):
    # Make all letters lowercase
    desc = desc.lower()
    # Cleaning the description from special characters i.e. punctuation
    desc = re.sub(r'[^\w\- ]','',desc)

    cleanDesc = []
    for word in desc.split():
        word = lemmatizer.lemmatize(word)
        cleanDesc.append(word)

    return " ".join(cleanDesc)


In [None]:
# Do the cleaning
df['clean_description'] = df['description'].apply(preProc)

In [None]:
df.head(10)

## **1.**

###*a)*

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

desc_list = df['clean_description'].tolist()

# Create the boolean vectorizer using CountVectorizer
bow_vectorizer = CountVectorizer(max_df=1.0, min_df=1, max_features=1000,
stop_words='english', ngram_range=(1,2))

# Boolean representation of unigrams and bigrams of movies descriptions
BoWs = bow_vectorizer.fit_transform(desc_list).toarray()

print(BoWs.shape)

###*b)*

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create the TF-IDF vectorizer using CountVectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=1.0, min_df=1, max_features=1000,
stop_words='english', ngram_range=(1,2))

# TF-IDF representation of unigrams and bigrams of movies descriptions
tf_idf = tfidf_vectorizer.fit_transform(desc_list).toarray()

print(tf_idf.shape)

## **2.**

### Below we define our metric functions

In [None]:
"""
Similarity finders
"""
# Define our own jaccard similarity method
def jaccard_coeffient(a,b):
    return np.minimum(a,b).sum()/np.maximum(a,b).sum()

# Define our own cosine similarity method
def cosine_similarity(a,b):
    return np.dot(a,b)/(np.linalg.norm(a)*np.linalg.norm(b))

###For every movie we will keep two maxheaps that contain the most similar movies as to the jaccard coefficient metric and the cosine similarity metric

In [None]:
# Function to update each movie's heap
def tryPushToHeap(id_to_push, jac_heap, cos_heap, jac_sim, cos_sim, N_most=100):
    # check jaccard heap
    if len(jac_heap) < N_most: # heap not full, just push the new movie
        heappush(jac_heap, (jac_sim, id_to_push))
    elif nsmallest(1, jac_heap)[0][0] < jac_sim: # new movie is more similar than the least similar
        _ = heapreplace(jac_heap, (jac_sim, id_to_push))

    # now check cosine heap 
    if len(cos_heap) < N_most:
        heappush(cos_heap, (cos_sim, id_to_push))
    elif nsmallest(1, cos_heap)[0][0] < cos_sim: # If heap is full, check if the smallest value is less than the new cosine simularity
        _ = heapreplace(cos_heap, (cos_sim, id_to_push))
        

###The algorithm we implement below iterates through every combination of two movies and calculates the pair's metric scores. The scores will be stored to the movies' heaps if they are better than the existing scores or if the heaps are not yet filled with the N most similar movies.

In [None]:
from heapq import heappush, heapify, nlargest, nsmallest, heapreplace
""" 
This will be a dictionary that contains show_id's as keys 
and two maxheaps, one for each metric (jaccard, cosine), with 100 most similar movies along with their scores
"""
top_100_sim = {}
# Initialize the heaps
for id in df['show_id']:
    top_100_sim[id] = {'jaccard':[], 'cosine':[]}
    heapify(top_100_sim[id]['jaccard'])
    heapify(top_100_sim[id]['cosine'])


for i in range(df.index.size):
    mov_id = df.iloc[i, df.columns.to_list().index('show_id')] # Find the movie id

    # Find the movie's heaps
    jac_heap = top_100_sim[mov_id]['jaccard']
    cos_heap = top_100_sim[mov_id]['cosine']

    for z in range(i + 1, df.index.size):
        new_mov_id = df.iloc[z, df.columns.tolist().index('show_id')]

        jac_sim = jaccard_coeffient(BoWs[i], BoWs[z]) # Calculate the similarity using jaccard
        cos_sim = cosine_similarity(tf_idf[i], tf_idf[z]) # Calculate the similarity using cosine

        # Find the new movie's heaps
        new_mov_jac_heap = top_100_sim[new_mov_id]['jaccard']
        new_mov_cos_heap = top_100_sim[new_mov_id]['cosine']

        # Updating the two movie heaps simultaneously
        tryPushToHeap(new_mov_id, jac_heap, cos_heap, jac_sim, cos_sim)
        tryPushToHeap(mov_id, new_mov_jac_heap, new_mov_cos_heap, jac_sim, cos_sim)


## **3.**

###Now we will just access the dictionary we created from the previous step using the key (*'show_id'* column in dataset) that corresponds to the movie's title and we will retrieve the N movies with the highest scores from the two max heaps.

In [None]:
def get_similar_movies1(title, N=10, method='boolean'):

    # Using the title, find the movie ID
    movie_id = df[df['title'] == title]['show_id'].values[0]
    if method == 'boolean':
        top = nlargest(N, top_100_sim[movie_id]['jaccard']) # Keep the N most similar movies using Jaccard as metric
    elif method == 'tf-idf':
        top = nlargest(N, top_100_sim[movie_id]['cosine']) # Keep the N most similar movies using Cosine as metric
    
    # Print the results
    print(f"Queried movie's description:\n\
{df[df['title'] == title]['description'].values[0]}\n\nRecommendations:")

    for i, (score, movie_id) in enumerate(top):
            row = df[df['show_id'] == movie_id]
            print(f'~~~~~~{i+1}~~~~~~')
            print(f"Title: {row['title'].values[0]} | Similarity: {(score*100).round(2)}%")
            print(f"Description: {row['description'].values[0]}\n")

In [None]:
get_similar_movies1('Inception', 10, 'tf-idf')

In [None]:
get_similar_movies1('Inception', 10, 'boolean')

##4.

###From the movie description we are given we will create a new vector representation by using our pretrained vectorizer and then we will just find the N most similar movies based on the description.

In [None]:
# Wrapper function so that the 'get_similar_movies2' can use the vectorizer and the two different representations
def wrapper(bow_vectorizer, tf_vectorizer, df, BoWs, tf_idf):
    def get_similar_movies2(desc, N=10, method='tf-idf'):
        nonlocal bow_vectorizer, tf_vectorizer, df, BoWs, tf_idf
        
        if method == 'boolean':
            vectorizer = bow_vectorizer
            vecs = BoWs
        else:
            vectorizer = tf_vectorizer
            vecs = tf_idf

        desc_clean = preProc(desc) # Clean the new description
        movie_vec = vectorizer.transform([desc_clean]).toarray() # Vectorize the cleaned description

        heap = []
        heapify(heap)

        # Calculate the similarity of that vector with all the other vectors in the representation (BoW or tf-idf)
        for i in range(df.index.size):

            if method == 'boolean':
                score = jaccard_coeffient(movie_vec[0], vecs[i])
            else:
                score = cosine_similarity(movie_vec[0], vecs[i])
            
            mov_id = df.iloc[i, df.columns.to_list().index('show_id')]
            # Fill the heap of similarities that corresponds to the description given
            heappush(heap, (score, mov_id))

        # Keep the N most similar
        top = nlargest(N, heap)

        # Printing the results
        for i, (score, movie_id) in enumerate(top):
            row = df[df['show_id'] == movie_id]
            print(f'~~~~~~{i+1}~~~~~~')
            print(f"Title: {row['title'].values[0]} | Similarity: {(score*100).round(2)}%")
            print(f"Description: {row['description'].values[0]}\n")

            
    return get_similar_movies2

get_similar_movies2 = wrapper(bow_vectorizer, tfidf_vectorizer, df, BoWs, tf_idf)

In [None]:
get_similar_movies2("War between America and Vietnam", N=10, method='tf-idf')

In [None]:
get_similar_movies2("War between America and Vietnam", N=10, method='boolean')

# ***Παρατηρήσεις***
Όπως ήταν αναμενόμενο, η μέθοδος TF-IDF μας δίνει καλύτερα αποτελέσματα με υψηλότερα ποσοστά ομοιότητας.

Συγκεκριμένα, όσων αφορά την ‘get_similar_movies1’ η πρώτη πιο όμοια ταινία του Inception με την μέθοδο boolean μας επιστρέφει ένα ποσοστό ομοιότητας της τάξης του 20% ενώ με την μέθοδο TF-IDF, μέχρι και η δεκάτη πιο όμοια ταινία έχει ένα ποσοστό μεγαλύτερο από αυτό (25.25%).  

Όσων αφορά την ‘get_similar_movies2’, πάλι βλέπουμε ακόμα μεγαλύτερη διάφορα στις ομοιότητες. Με την μέθοδο tf-idf η most similar ταινία που αντιστοιχεί στην δοθείσα περιγραφή, διαθέτει ποσοστό ομοιότητας 46.39% ενώ με την μέθοδο boolean μόλις 25%.

Είναι απόλυτα λογικό το γεγονός ότι οι δυο υλοποιήσεις μας επιστρέφουν διαφορετικά αποτελέσματα, διότι έχουν και άλλον τρόπο κατά το vectorizing άλλα και άλλο τρόπο σύγκρισης των vector. 

Με τα δεδομένα που διαθέτουμε, αναμέναμε καλύτερα αποτελέσματα με την tf-df μέθοδο όπως προαναφέραμε άλλα σε συνδυασμό με την cosine similarity, είχαμε ακόμη καλύτερα αποτελέσματα. Αυτό συμβαίνει διότι όταν έχουμε επανάληψη δεδομένων, εχει αποδειχθεί πως η cosine είναι καλύτερος τρόπος σύγκρισης από την jaccard.

Επίσης, παρατηρείται πως μεταξύ των N πιο ομοίων ταινιών, η διαφοροποίηση στην ομοιότητα που έχουμε, κάνοντας χρήση την μέθοδο boolean, είναι πολύ μικρή. Και για τα δυο ερωτήματα (3,4) η απόκλιση της πιο όμοιας ταινίας με την N πιο όμοια είναι κατά μέσο όρο ~4%. Αυτό σημάνει πως η μέθοδο δεν είναι πολύ λεπτομερείς και δεν μπορεί να εντοπίσει εύκολα διάφορες ή ομοιότητες.
Στα ίδια ερωτήματα, με την μέθοδο tf-idf η ίδια απόκλιση είχε τιμή ~ 9%, που όσο το N ανεβαίνει τόσο θα ανεβαίνει.

