# Tackling the Cold Start problem (Support Vector Machines)

    Cold start happens when new users or new items arrive in e-commerce platforms.
    Classic recommender systems like collaborative filtering assumes that each user or item has some ratings so that we can infer ratings of similar users/items even if those ratings are unavailable.
    However, for new users/items, this becomes hard because we have no browse, click or purchase data for them. As a result, we cannot “fill in the blank” using typical matrix factorization techniques.

### Imports:

    pandas => Used for storing and basic handling of the data.

    numpy => Used for math and logic operations on our data.

    sklearn => Used for machine learning models,preparing the input data and evaluating the model performance.

    joblib => Used for data serialization.

    tkinter => Used for making UI in Python.
    
    Label Encoder => Used for encoding data.

In [8]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score
import joblib
import tkinter

#----Association Rules -----
from sklearn.preprocessing import LabelEncoder
import collections


### Data Preprocesing
    
    Data preprocessing function is used to separate relevant data from the unrelevant and
    transform it so that it is ready for SVM training.

In [4]:
def dataPreprocessing(data):

    data = data.fillna(0)

    data_green_blue = data.loc[:,'Occurrence: General Liability':'Terrorism']
    data_green_blue = data_green_blue.values.tolist()
    data_green_blue = np.array(data_green_blue)


    return data_green_blue

![Output!](images/tabela)

### SVM training
    SVM training fuction trains n models,where n is the number of forms.
    Input data for each model consists of positive and negative samples.
    Positive samples are all the "users"(their features) that choose that specific form.
    Negative samples are all the "users"(their features) that didn't choose that specific form.
    
    All classifiers get serialized.
    Also svm validation accuracies are saved in a dictionary with model numbers as keys.
    The accuracy dictionary is also serialized.

In [3]:
def trainSVMmodels(data):
    #petlja ima iteracija koliko postoji kolona u tabeli
    index = 42
    num = 1
    iterator = 42

    for form in range(data.shape[1] - index):
        data_pos = data[data.iloc[:,iterator] == 1]
        data_neg = data[data.iloc[:,iterator] == 0]

        x_pos = dataPreprocessing(data_pos)
        x_neg = dataPreprocessing(data_neg)
        positive_label = "Izabrana forma"
        negative_label = "Nije izabrana forma"

        n = len(x_pos) + len(x_neg)
        labels = []
        for i in range(n):
            if(i < len(x_pos)):
                labels.append(positive_label)
            else:
                labels.append(negative_label)

        x = np.vstack((x_pos, x_neg))
        y = np.array(labels)

        x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

        clf = svm.SVC(kernel='linear')
        clf.fit(x_train, y_train)
        y_train_pred = clf.predict(x_train)
        y_test_pred = clf.predict(x_test)
        print("Train accuracy: ", accuracy_score(y_train, y_train_pred))
        print("Validation accuracy: ", accuracy_score(y_test, y_test_pred), '--->' , num )
        accuracy_dictionary[num] =  accuracy_score(y_test, y_test_pred)
        joblib.dump(clf,'svm_models/clf{}'.format(num))
        iterator += 1
        num += 1

### Predict forms from the given input
    Each model will predict if that certain form would be selected or not based on the input information.
    Then from positive predictions the ones that have the best accuracy are returned.
    
    The forms that are predicted are serialized.


In [11]:
 def predict(input_data):
        lista = []
        result_dict = dict()
        lista.append(input_data)
        x = np.array(lista)
  
        num = 1
        for m in range(model_n):
            clf = joblib.load('svm_models/clf{}'.format(num))
            y_test_pred = clf.predict(x)
            for label in y_test_pred:
                if label == "Izabrana forma":
                    #print(label + ' ' + form_names[num])
                    #result.append(form_names[num])
                    result_dict[num] = form_names[num]
            num += 1

        evaluated = dict()
        for key in result_dict.keys():
            evaluated[accuracy_dictionary[key]] = form_names[key]

        for key in evaluated.keys():
            if key > 0.75:
                result.append(evaluated[key])


        joblib.dump(result,'cold_start_output')
        

## Main :
    Input data has 25 elements for 25 claims and limitations that user can select or not.
    Tkinter UI lets the user to chose whichever they want.
     


In [18]:
data = pd.read_excel('BC - AI ORIGINAL.xlsx', sheet_name= None)
data = data['BC - AI - V1']
model_n = data.shape[1] - 42
dictionary_keys = []

for i in range(model_n):
    dictionary_keys.append(i)

accuracy_dictionary = joblib.load('accuracy_dictionary')

#trainSVMmodels(data)
#joblib.dump(accuracy_dictionary,'accuracy_dictionary')

forms = data.loc[:,"MJIL 1000 08 10":"MIL 1214 09 17"]

form_names = []
for col in forms.columns:
    form_names.append(col)


result = []
model_n = data.shape[1] - 42

input_data = [0,1,0,0,0,1,0,0,0,1,1,1,1,0,0,0,0,0,1,1,0,1,0,0,0]

predict(input_data)
result = joblib.load('cold_start_output')


print("Recommended forms:")
print("--------------")
for r in result:
    print(r)


Recommended forms:
--------------
MEIL 1231 10 13
IL 00 21 09 08
MEIL 1200 10 16
MDGL 1000 01 13
CG 00 01 04 13
ME 037 04 99
MEGL 0008 01 16
MEGL 0219 05 16
MEGL 1361 05 16


 Final output is a list of recommended forms and that is the end of step one, which is solving the
    cold start problem.
  
 Next step is predicting more forms based on which forms user selected from the list (explicit user interaction) using association rules.
   
   

# Collaborative filltering using Association Rules

         Collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating).
                
         Association rule mining, at a basic level, involves the use of machine learning models to analyze data for patterns, or co-occurrence, in a database. It identifies frequent if-then associations, which are called association rules.



### Making Powersets
        The get_subsets function makes a powerset from a list and sorts it based on the number of elements,in a descending order.


In [13]:
def Sorting(lst):
    lst2 = sorted(lst, key=len)
    return lst2

def get_subsets(fullset):
    listrep = list(fullset)
    subsets = []
    for i in range(2**len(listrep)):
        subset = []
        for k in range(len(listrep)):
            if i & 1<<k:
                subset.append(listrep[k])
        subsets.append(subset)

    subset = Sorting(subsets)
    subset = list(reversed(subset))

    return subset

### Data Preprocessing
    This function prepares the data and makes an input for fp growth to find patterns.

In [14]:
def prep_data_for_fpgrowth(data):
    input_data = []

    for index, row in data.iterrows():
        dict_values_list = []
        for i in range(len(row.values)):
            if row[i] == 1:
                dict_values_list.append(row.index[i])

        dict_values_list = le.transform(dict_values_list)
        input_data.append(dict_values_list)

    return input_data

## Association Rules Analyzing
        This functions analyzes association rules and returns recommended forms for the given input.
  
  Input is a list of user selected forms from the step one ("cold_start_output")
  
### Rules mining algorythm:
  
  1. The list is transformed into a powerset.
  2. In keys of association rules dictionary we try to find the first element (element with the biggest length) of the powerset.
  3. If we find it, we save the values for that key.
  4. If not, we do the same thing with other elements of the subset that are smaller in length then the first element.
  5. Repeat        
         

In [15]:
def rules_analyzer_and_predictor(final_input):
    subset = get_subsets(final_input)

    result = []

    s = subset.pop(0)

    s = le.transform(s)
    s = tuple(s)
    input_is_key = rules.get(s)
    if input_is_key:
        value = rules[s]
        output = le.inverse_transform(value[0])
        output = output.tolist()
        result.append(output)
    else:
        num = 0
        for s in subset:
            s = le.transform(s)
            s = tuple(s)
            if num == 0:
                set_len = set_the_length_of_the_current_set(s, result)
                num += 1
            else:
                if set_len == len(s):
                    input_is_key = rules.get(s)
                    if input_is_key:
                        value = rules[s]
                        output = le.inverse_transform(value[0])
                        output = output.tolist()
                        result.append(output)
                        num += 1
                    else:
                        continue
                else:
                    if not result:
                        set_len = set_the_length_of_the_current_set(s, result)

    temp = []
    for res in result:
        for r in res:
            temp.append(r)

    ctr = collections.Counter(temp)

    # sort dictionary by frequency
    {k: v for k, v in sorted(ctr.items(), key=lambda item: item[1])}
    print("Recommended forms:")
    print("--------------")
    for form in ctr:
        print(form)


In [16]:
def set_the_length_of_the_current_set(set,result):
    set_len = len(set)
    input_is_key = rules.get(set)
    if input_is_key:
        value = rules[set]
        output = le.inverse_transform(value[0])
        output = output.tolist()
        result.append(output)
    return set_len

# Main : 
      First, Label Encoder is trained on all labels of all forms.
      Then FP GROWTH is used to find patterns in all the data regarding form iteraction.
      From those patterns, rules are formed using "generate_association_rules" function.
      Rules are in the form of a python dictionary.
    
      We use the rules_analyzer_and_predictor function to predict more rules from the given input.
    

In [17]:
labels = []

data = data.loc[:,'MJIL 1000 08 10':'MIL 1214 09 17']

for d in data:
    labels.append(d)



le = LabelEncoder()
le.fit(labels)


input_data = prep_data_for_fpgrowth(data)


#patterns = find_frequent_patterns(input_data,support_threshold = 280)
#rules = fpg.generate_association_rules(patterns,0.7)
#print(len(rules))
#joblib.dump(rules,'serialized_model/rules')


rules = joblib.load('serialized_model/rules')

recommended_forms = joblib.load('cold_start_output')

final_input = ['IL 00 17 11 98', 'MEIL 1200 10 16', 'MEIL 1231 10 13', 'MEGL 0008 01 16', 'MEGL 0219 05 16']

rules_analyzer_and_predictor(final_input)
                    
                    
                    

Recommended forms:
--------------
MDIL 1000 08 11
MDIL 1001 08 11
MEIL 1225 10 11
MJIL 1000 08 10
MPIL 1007 03 14
