# Netflix personal reccomandation system - Semi-supervised learning.

Considering the amount of Netflix samples (~9000) compared to the amount of my labeled Netflix data (~200), the difference is too big to allow a supervised classification. Moreover, when considering all the features (including the ones that require encoding), the number of features in a supervised algo will be very large relative to the number of available labeled samples (will probably result in sparse data and overfitting). Thus a different approach is required. Here, I will create a semi-supervised algorithm that uses the abundant Netflix unlabeled damples in order to create features that will feed a supervised classification trained on the labeled data. 
For the unsupervised part I thought of using a Association rule mining algo but since I have several continuous features (binning was too crude and will also result in sparse data), I decided to take a different approach.
I will use a k-means algo for the unsupervised learning so feature scalling is required (distances based algorithm). 

## Word2Vec algo:

(Note: The next lesson will be to switch the Word2Vec algo with the Bert algo.)

First of all, we will create a Word2Vec class in order to use the Title and most importantly- Description features. The reason for creating the class is to use different Word2Vec models for the different features. 

In [None]:
import gensim
from gensim.models import Word2Vec
import pandas as pd

class Word2Vec_feature_creation():
    def __init__(self, vector_size, train_data, data2embed, window=1):
        self.vec_size = vector_size
        self.training_data = train_data
        self.window = window
        self.data2embed = data2embed
    
    def tokenize(self, data):
        tokenized_data = [gensim.utils.simple_preprocess(text) for text in data]
        return tokenized_data

    def train(self):
        tokenized_train = self.tokenize(self.training_data)
        return Word2Vec(tokenized_train, vector_size=self.vec_size, window=self.window, sample=1e-5, min_count=1, workers=1, seed=2)

    def word_embedding(self):
        embedded = []
        model = self.train()
        tokenized_data = self.tokenize(self.data2embed)
        for sentence in tokenized_data:
            embeddings = [model.wv[word] for word in sentence]
            if embeddings:
                sentence_embedding = sum(embeddings) / len(embeddings)  # Average of word embeddings
                embedded.append(sentence_embedding)
        return embedded

Lastly, we want to start visualizing the different features and apply the Word2Vec word embedding on the relevant features.

## Preprocessing:

### General view:

Since the director has a significant amount of empty values and the values cannot be taken as the mean value, we will skip this feature since using it can create a strong bias in the network. 
Let’s dive deeper into the remaining feature:


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df_N = pd.read_csv(r"INSER FILE LOCATION HERE", skipinitialspace = True, skip_blank_lines=True)
print(df_N.isna().sum())
df_N.drop(['show_id', 'country', 'date_added', 'director'], axis=1, inplace=True)

### Rating:

There are 4 samples with NaN value in the rating column. Since it’s such a small amount relative to the number of samples, we will choose the most frequent option to fill the 4 blanks. Let's see the categories and their frequancy:

In [None]:
rating_data = df_N['rating'].dropna().astype(str)
unique_rating = np.unique(rating_data.dropna())

x, _, _ = plt.hist(rating_data, bins=np.size(unique_rating), range=(-0.5,13.5), histtype='step')
plt.xlabel('Rating')
plt.ylabel('frequancy')
y_line = 6
plt.hlines(y=y_line, xmin=-0.5, xmax=14, color='red', linestyle='--')
plt.text(x=12, y=y_line+1, s='Frequency = ' + str(y_line), fontsize=8, color='red')
plt.xticks(rotation='vertical')
plt.show()

Now, since the most frequant value is 'TV-MA', we can assign this values to all NaN vales in the column:

In [None]:
df_N['rating'].fillna('TV-MA', inplace=True)

Looking a bit closer will show us that there are 3 categories (‘TV-Y7-FV’, ‘UR’ and ‘NC-17’) with less than 7 occurrences. Since the rating is a categorical feature, creating three more columns (one for each category below the red line) will result in a very sparse network and will add very little information to the network. Thus, we will eliminate the samples related to these three categories. We will now remove the relevant samples from the Netflix dataset (df_N). Since we know we have 12 samples to remove, printing the shape of the 0 axis of the dataset will insure we removed the rows:

In [None]:
drop_inx = df_N[(df_N['rating'] == 'TV-Y7-FV')  | (df_N['rating'] == 'UR') | (df_N['rating'] == 'NC-17')].index
print(df_N.shape[0])
df_N.drop(drop_inx, inplace=True)
df_N.reset_index(inplace=True, drop=True)
print(df_N.shape[0])

Since the rating is a single column, we can use sklearn.preproccessijg.OneHotEncoder to encode the categories. We will create a 'features' df to store all the processed features:

In [None]:
# Create an encoder for the 'Rating' feature
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
encoded_columns = encoder.fit_transform(df_N['rating'].to_frame())
feature_values = encoder.categories_[0]
column_names = [value for value in feature_values]
features = pd.DataFrame(encoded_columns, columns=column_names)

### Categories (Listed in):

The listed_in column contained multiple values for each sample (up to 3). The first step was to seperate them to 3 different columns. Next step was to examine the data. Since the first column has no NaN values, each sample is associated with at least 1 and up to 3 categories. Let's plot a histogram of all the possible values (without NaNs). 

In [None]:
listedIn_dataframe = df_N.loc[:,'listed_in1':'listed_in3'].fillna('')
listedIn_flat = listedIn_dataframe.to_numpy().flatten()
listedIn_flat = np.delete(listedIn_flat, np.where(listedIn_flat == ''))
unique_LI= np.unique(listedIn_flat)

hist = plt.hist(listedIn_flat, bins=np.size(unique_LI), histtype='step')
plt.xlabel('Listed in')
plt.ylabel('frequancy')
y_line=50
plt.hlines(y=y_line, xmin=0, xmax=44, color='red', linestyles='--', linewidth=0.7)
plt.text(x=38, y=y_line+3, s='Frequency = ' + str(y_line), fontsize=8, color='red')
plt.xticks(rotation='vertical')
plt.show()

Now we need to create the features for the 'Listed In' features. The processed features will be save in a df named 'features'.

In [None]:
def CategorizeListedIn(dataframe):
    samplesNum = np.size(dataframe, 0)
    unique_vals = np.unique(dataframe.to_numpy().flatten())[1:]
    features = pd.DataFrame(columns = unique_vals, index = np.arange(0, samplesNum)).fillna(0)
    for i in range(samplesNum):
        feature0 = dataframe['listed_in1'][i]
        features[feature0][i] = 1
        c = dataframe['listed_in3'][49]
        if dataframe['listed_in2'][i] != '':
            feature1 = dataframe['listed_in2'][i]
            features[feature1][i] = 1
        if dataframe['listed_in3'][i] != '':
            feature2 = dataframe['listed_in3'][i]
            features[feature2][i] = 1
    return features

y = df_N.loc[:,'listed_in1':'listed_in3'].fillna('')
listedIn_features = CategorizeListedIn(df_N.loc[:,'listed_in1':'listed_in3'].fillna(''))
features = pd.concat([features, listedIn_features], axis=1)

### Type:

Since the 'Type' has 0 Nan value, and it contains twi categories: 'Movie' and 'TV Show', we can go ahead and create the categorical features for it:

In [None]:
# Create an encoder for the 'Type' feature
encoder = OneHotEncoder(sparse=False)
encoded_columns = encoder.fit_transform(df_N['Type'].to_frame())
feature_values = encoder.categories_[0]
column_names = ['Type_' + value for value in feature_values]
encoded_df = pd.DataFrame(encoded_columns, columns=column_names)
features = pd.concat([features, encoded_df], axis=1)

### Release Year:

Again, all samples have the release year listed. Let's check the histogram of the category:

In [None]:
release_data = df_N['release_year']
unique_release = np.unique(release_data)

plt.hist(release_data.sort_values(), bins=np.arange(release_data.min() + 1.5, release_data.max() + 1.5) - 0.5, histtype='step')
plt.xlabel('Release year')
plt.ylabel('frequancy')
plt.xticks(rotation='vertical')
plt.show()


As expected, the frequancy of the release year increases exponentialy as we are getting closer to the present. This feature needs only scaling in the preprocessing stage. Since the data is exponential we want to consider renormalizing it. The reason we might want to renormalize it is because the shape of the distribution was determined by the development of technology and the increasing popularty of Netflix and not due to some feature of each specific content. Nevertheless, the features of the content is directly related to it's release year and this can directly influence the popularity of the show. Here, in order to take both considerations under account, we will use a Quantile Transformer Scaler that is not sensitive to outliers but also transforms the features to a uniform distribution.

In [None]:
from sklearn.preprocessing import QuantileTransformer

QSc = QuantileTransformer(output_distribution='uniform')
feature_values = QSc.fit_transform(release_data.to_frame())
features['release_year'] = feature_values.tolist()

### Duration:

The 'Duration' feature have zero missing values. A quick check can teel us that for each one of the samples of type 'Movie' the duration is written in 'min' and for samples of type 'TV Show' the duration is the number of seasons: 
(Run if you want to double check)

In [None]:
odds = 0
for i in range(df_N.shape[0]):
    if df_N['Type'][i] == 'TV Show' and 'min' in df_N['duration'][i]:
        odds += 1
    if df_N['Type'][i] == 'Movie' and 'Season' in df_N['duration'][i]:
        odds += 1
print('number of odds is: ' + str(odds))

Since the difference between the 'min' and 'Seasons' is already accounted for by the 'Type' feature, all we need to do here is change the data to int instead of strings and later on apply an appropriate scaler. To split the strings to seperate columns we will use the pandas split function:

In [None]:
duration_feature = df_N['duration'].str.split(expand=True)[:][0]

Let's plot the distributions of the data seperatly for movies and TV shows:

In [None]:
movies_duration = duration_feature[df_N[df_N['Type'] == 'Movie'].index].astype(int)
TvShow_duration = duration_feature[df_N[df_N['Type'] == 'TV Show'].index].astype(int)

fig, ax = plt.subplots(2)
fig.subplots_adjust(hspace=0.348)
ax[0].hist(movies_duration.sort_values(), histtype='step', bins=np.arange(0, movies_duration.max() + 1.5) - 0.5)
ax[0].set_ylabel('frequancy')
ax[1].set_xlabel('Minutes')    
ax[0].set_title('Movies')
ax[1].hist(TvShow_duration.sort_values(), histtype='step', bins=np.arange(0, TvShow_duration.max() + 1.5) - 0.5)
ax[1].set_ylabel('frequancy') 
ax[1].set_xlabel('Seasons')    
ax[1].set_xticks(range(0, 18, 1))
ax[1].set_title('TV Shows')
plt.show()

We are going to split the feature into two columns, the duration of movies and the duration of TV shows since there is a difference in the units of measurement for each "Type" of sample (the duration of movies is counted in minuts and the duration of tv shows in seasons). 
The movies disrtibution is similar to a normal distribution with mean of ~100 only it's a bit skewed on the left side. This feature will need to be scaled using standard scaler which fits for the scaling of normally distributed data. 

The Seasons number distributes exponentially. This time we can assume the exponential decrease of the number of seasons is strongly related to the successs of the Tv Show. A min-max scaler is better suited for this distribution. Nevertheless, using these two different scalers for the two duration features will result in a bigger weight of the 'TV show duration' since the min-max-scaling has a stronger shrinking effect than standard scaling.

We will thus use a standard scaler for both features. 

In [1]:
duration_feature = df_N['duration'].str.split(expand=True)[:][0]
features_vals = pd.DataFrame(0, columns=['duration_movies', 'duration_TvShows'], index=range(df_N.shape[0]))
for i in range(df_N.shape[0]):
    val = df_N['duration'][i].split()[0]
    if 'min' in df_N['duration'][i]:
        features_vals['duration_movies'][i] = val
    else:
        features_vals['duration_TvShows'][i] = val

from sklearn.preprocessing import StandardScaler
SSc_movies = StandardScaler()
features_vals['duration_movies'] = SSc_movies.fit_transform(features_vals['duration_movies'].to_frame())
SSc_series = StandardScaler()
features_vals['duration_TvShows'] = SSc_series.fit_transform(features_vals['duration_TvShows'].to_frame())

features = pd.concat([features, features_vals], axis=1)

NameError: name 'df_N' is not defined

### Title & Description

As for the Title and description, we will use the Word2Vec class previously created for extracting features from the string parameters. 