# 6. Klassifikation
## Anforderungen an Projektumsetzung: Klassifikation

---
**AUFGABE:**

- Führen Sie mit dem Algorithmus Ihrer Wahl eine Klassifikationsaufgabe auf Ihren Daten durch.
- Teilen Sie dazu zunächst die Daten auf, um Overfitting beim Trainieren des Algorithmus und bei der Parameterauswahl zu vermeiden. Erklären Sie die gewählte Strategie und die Größenverhältnisse.
- Wählen Sie geeignete Features aus und setzen Sie die Parameter des Algorithmus. Beschreiben Sie das gewälhte Vorgehen für die Auswahl der Features und Parameter. Berichten Sie den Parameterraum und die final gewählten Parameter. Geben Sie die Performanz auf den Trainingsdaten (bzw. Entwicklungsdaten, falls verwendet) an.

In [1]:
# Imports für unten

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectFromModel
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import cross_val_predict
import pandas as pd

apps = pd.read_csv("Daten/Google-Playstore_Edit2.csv")

In [2]:
apps.head()

Unnamed: 0,App Name,App Id,Category,Rating,Rating Count,Installs,Minimum Installs,Maximum Installs,Free,Price,...,Developer Email,Released,Last Updated,Content Rating,Privacy Policy,Ad Supported,In App Purchases,Editors Choice,Scraped Time,Released Year
0,BBD Store - Best Budget Deals,com.app.BBD_Store,Shopping,5.0,20.0,100+,100.0,179,True,0.0,...,bbdstore9137@gmail.com,2020-12-31,2021-05-27,Everyone,https://bbd-store-0.flycricket.io/privacy.html,True,False,False,2021-06-16 00:05:22,2020.0
1,아가펫츠,com.suholdings.agapets,Lifestyle,5.0,8.0,100+,100.0,152,True,0.0,...,info@agapets.com,2020-06-16,2020-08-06,Teen,http://www.agapets.com/terms/terms_privacy.html,True,False,False,2021-06-16 02:17:56,2020.0
2,Zákamenné,com.impoinfo.zakamenne,Communication,5.0,8.0,"1,000+",1000.0,1338,True,0.0,...,adsupra.com@gmail.com,2020-06-25,2020-01-20,Everyone,http://www.impoinfo.com/policy_sk.html,False,False,False,2021-06-16 05:12:07,2020.0
3,Parihans social app,com.wParihansVideocallingandmessenger_11521310,Social,5.0,22.0,100+,100.0,145,True,0.0,...,parihans2931@gmail.com,2020-07-10,2020-12-02,Teen,https://sites.google.com/view/parihans-vlog/pr...,False,False,False,2021-06-16 05:39:39,2020.0
4,CHATRIS 2019 - Match 3 Puzzle Game,com.Doosin.CHATRIS1,Puzzle,5.0,13.0,50+,50.0,69,True,0.0,...,doosin97@gmail.com,2019-09-11,2019-09-12,Everyone,https://blog.naver.com/hd9172/221612868565,True,True,False,2021-06-16 02:36:45,2019.0


In [3]:
# Reihenfolge des Dataframes geändert
apps = apps[['App Name', 'App Id', 'Rating', 'Rating Count', 'Installs', 'Minimum Installs', 'Maximum Installs', 'Free', 
             'Price', 'Currency', 'Size', 'Minimum Android', 'Developer Id', 'Developer Website', 'Developer Email', 
             'Released', 'Last Updated', 'Content Rating', 'Privacy Policy', 'Ad Supported', 'In App Purchases', 'Editors Choice', 'Scraped Time', 'Category']]

In [4]:
# Alle Spalten mit Unique-Werten werden gedropped - zu viel Rechenkapa notwendig
apps_v2 = apps.copy()
apps_v2.drop(columns=['App Name', 'App Id', 'Developer Id', 'Developer Website', 'Developer Email', 'Privacy Policy', 'Released', 'Scraped Time', 'Last Updated'], inplace=True)

In [5]:
apps_v2['Free'] = apps_v2['Free'].astype(str).copy()
apps_v2['Ad Supported'] = apps_v2['Ad Supported'].astype(str).copy()
apps_v2['Editors Choice'] = apps_v2['Editors Choice'].astype(str).copy()
apps_v2['In App Purchases'] = apps_v2['In App Purchases'].astype(str).copy()

In [6]:
# Daten aufbereiten und in Test- und Trainingsdaten aufteilen

dataset1 = apps_v2.dropna().loc[:,['Rating','Rating Count', 'Size', 'Minimum Installs', 'Maximum Installs', 'Price']] #.copy()

dataset2 = apps_v2.dropna().loc[:, 'Rating':'Editors Choice'] #.copy()
#list = dataset2.select_dtypes(include=["object"])
#X_onehot = pd.get_dummies(list)

In [7]:
apps_v2.select_dtypes(include=["object"])

Unnamed: 0,Installs,Free,Currency,Minimum Android,Content Rating,Ad Supported,In App Purchases,Editors Choice,Category
0,100+,True,USD,4.4 and up,Everyone,True,False,False,Shopping
1,100+,True,USD,5.0 and up,Teen,True,False,False,Lifestyle
2,"1,000+",True,USD,4.0.3 and up,Everyone,False,False,False,Communication
3,100+,True,USD,4.4 and up,Teen,False,False,False,Social
4,50+,True,USD,5.0 and up,Everyone,True,True,False,Puzzle
...,...,...,...,...,...,...,...,...,...
2312730,"1,000+",True,USD,Varies with device,Teen,True,False,False,Role Playing
2312731,10+,True,USD,Varies with device,Everyone,False,False,False,Simulation
2312732,100+,True,USD,Varies with device,Everyone,False,False,False,Productivity
2312733,50+,True,USD,Varies with device,Everyone,False,False,False,Health & Fitness


In [8]:
onehot = pd.get_dummies(apps_v2[['Installs', 'Free']])




In [9]:
twohot = pd.get_dummies(apps_v2[['Currency','Minimum Android']])

In [10]:
threehot = pd.get_dummies(apps_v2[['Content Rating','Ad Supported']])

In [11]:
fourhot = pd.get_dummies(apps_v2[['In App Purchases', 'Editors Choice']])

In [12]:
onehot

Unnamed: 0,Installs_0+,Installs_1+,"Installs_1,000+","Installs_1,000,000+","Installs_1,000,000,000+",Installs_10+,"Installs_10,000+","Installs_10,000,000+","Installs_10,000,000,000+",Installs_100+,...,"Installs_5,000,000+","Installs_5,000,000,000+",Installs_50+,"Installs_50,000+","Installs_50,000,000+",Installs_500+,"Installs_500,000+","Installs_500,000,000+",Free_False,Free_True
0,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
2,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2312730,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2312731,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2312732,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
2312733,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1


In [13]:
twohot

Unnamed: 0,Currency_AUD,Currency_BRL,Currency_CAD,Currency_EUR,Currency_GBP,Currency_INR,Currency_KRW,Currency_PKR,Currency_RUB,Currency_SGD,...,Minimum Android_6.0,Minimum Android_6.0 - 7.1.1,Minimum Android_6.0 - 8.0,Minimum Android_6.0 and up,Minimum Android_7.0,Minimum Android_7.0 and up,Minimum Android_7.1 and up,Minimum Android_8.0,Minimum Android_8.0 and up,Minimum Android_Varies with device
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2312730,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2312731,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2312732,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2312733,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [14]:
threehot

Unnamed: 0,Content Rating_Adults only 18+,Content Rating_Everyone,Content Rating_Everyone 10+,Content Rating_Mature 17+,Content Rating_Teen,Content Rating_Unrated,Ad Supported_False,Ad Supported_True
0,0,1,0,0,0,0,0,1
1,0,0,0,0,1,0,0,1
2,0,1,0,0,0,0,1,0
3,0,0,0,0,1,0,1,0
4,0,1,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...
2312730,0,0,0,0,1,0,0,1
2312731,0,1,0,0,0,0,1,0
2312732,0,1,0,0,0,0,1,0
2312733,0,1,0,0,0,0,1,0


In [15]:
fourhot

Unnamed: 0,In App Purchases_False,In App Purchases_True,Editors Choice_False,Editors Choice_True
0,1,0,1,0
1,1,0,1,0
2,1,0,1,0
3,1,0,1,0
4,0,1,1,0
...,...,...,...,...
2312730,1,0,1,0
2312731,1,0,1,0
2312732,1,0,1,0
2312733,1,0,1,0


In [None]:
X_concat = pd.concat([onehot, twohot, threehot, fourhot, dataset1])

target_dataset = apps.dropna()['Category'].copy()

# Wenn die Daten noch (weiter) aufgeteilt werden müssen, geht das am besten mit der 
# Methode sklearn.model_selection.train_test_split
# Sie liefert Features und Zielklassen für test und train zurück und lässt sich umfassend parametrisieren.

# Wir können also noch Entwicklungsdaten erzeugen, die 20% (test_size=0.2) der Trainingsdaten umfassen und alle vier Zielklassen 
# gleichermaßen berücksichtigen (stratifizieren, stratify=target_train):

X_train, X_test, y_train, y_test = train_test_split(X_concat, target_dataset, 
                                                    test_size=0.2, 
                                                    random_state=42,
                                                    stratify=target_train)

In [None]:
# Features auswählen
# mit threshold=None wird ein absoluter default-Schwellwert verwendet
# feature_selection = SelectFromModel(LinearSVC(penalty="l1", dual=False), threshold='mean')
# TfidfVectorizer wird nicht durchgeführt, da keine txt-Datein im Datensatz vorhanden sind

In [None]:
# Parameter des Algorithmus setzen
# Pipeline neu definieren - ohne Parametersetzen 

feature_selection=SelectFromModel(LinearSVC(penalty="l1", dual=False))
classifier=SVC()

pipeline = Pipeline([('feature_selection', feature_selection),
                     ('classifier', classifier)])

# Parameterraum definieren: key ist schrittname__parametername, value die zu prüfenden Werte

parameters = {  
    'feature_selection__threshold': (None, 'mean'), 
    'classifier__kernel': ('linear','rbf')
}

# Suche über den gesamten Parameterraum (cross validation über die Trainingsdaten)
grid_search = GridSearchCV(pipeline, param_grid=parameters, verbose=10)

grid_search.fit(data_datatrain1, target_train)

In [None]:
# Welche Parameterkombination ist die beste?

print(grid_search.best_estimator_)

# Wenn kein expliziter Parameter angegeben ist, hat der default am besten funktioniert.
# Wir wählen also  
# 'feature_selection__threshold': None, 
# 'classifier__kernel': 'rbf'

In [None]:
# Pipeline für die beste Feature-Kombination definieren
final_pipeline = Pipeline(steps=[('feature_selection',
                 SelectFromModel(estimator=LinearSVC(dual=False,
                                                     penalty='l1'))),
                ('classifier', SVC())])

# Wie gut ist der fertige Lerner auf den Trainingsdaten? 
# Evaluation per Crossvalidation (analog zur Parametersuche)
# Mit cross_val_predict merken wir uns die Vorhersage für jeden Datenpunkt, die gemacht wird, wenn er zum Testset 
# gehört; die Vorhersagen sind also ungesehen und liegen für den gesamten Datensatz vor.

train_labels = cross_val_predict(final_pipeline, x_train , y_train, cv=10)

# Precision/Recall/F-Wert berechnen

print(classification_report(y_train, train_labels))

In [None]:
# Schritt 5: Vorhersagen auf Testdaten machen und evaluieren

from sklearn.metrics import classification_report, confusion_matrix

# Jetzt den Lerner ein letztes Mal auf allen Trainingsdaten trainieren und dann auf den Testdaten evaluieren

# Lerner auf den gesamten Trainingsdaten trainieren
final_pipeline.fit(x_train, y_train)

# Lerner auf den Testdaten evaluieren

# Mit dem default score des Lerners: (durchschnittliche Accuracy bei SVC)

print("Default-Score des Klassifizierers: Accuracy=",final_pipeline.score(data_test['data'], data_test['target']), "\n")

# Labels vorhersagen lassen und dann Precision/Recall/F-Wert berechnen
test_labels = final_pipeline.predict(x_test)

print(classification_report(y_test, test_labels))