# 6. Klassifikation
## Anforderungen an Projektumsetzung: Klassifikation

---
**AUFGABE:**

- Führen Sie mit dem Algorithmus Ihrer Wahl eine Klassifikationsaufgabe auf Ihren Daten durch.
- Teilen Sie dazu zunächst die Daten auf, um Overfitting beim Trainieren des Algorithmus und bei der Parameterauswahl zu vermeiden. Erklären Sie die gewählte Strategie und die Größenverhältnisse.
- Wählen Sie geeignete Features aus und setzen Sie die Parameter des Algorithmus. Beschreiben Sie das gewälhte Vorgehen für die Auswahl der Features und Parameter. Berichten Sie den Parameterraum und die final gewählten Parameter. Geben Sie die Performanz auf den Trainingsdaten (bzw. Entwicklungsdaten, falls verwendet) an.
- Evaluieren Sie die Klassifikation auf den ungesehenen Testdaten. Betrachten Sie Precision und Recall sowie den F-Wert. Welches Maß ist für Ihre Anwendung wichtiger? Bewerten Sie Ihr Ergebnis. Ist es in der Praxis voraussichtlich zufriedenstellend?

In [1]:
# Imports für unten

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectFromModel
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import cross_val_predict
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd

In [2]:
apps = pd.read_csv("Daten/Google-Playstore_Edit2.csv")

In [3]:
apps.head(2)

Unnamed: 0,App Name,App Id,Category,Rating,Rating Count,Installs,Minimum Installs,Maximum Installs,Free,Price,...,Developer Email,Released,Last Updated,Content Rating,Privacy Policy,Ad Supported,In App Purchases,Editors Choice,Scraped Time,Released Year
0,BBD Store - Best Budget Deals,com.app.BBD_Store,Shopping,5.0,20.0,100+,100.0,179,True,0.0,...,bbdstore9137@gmail.com,2020-12-31,2021-05-27,Everyone,https://bbd-store-0.flycricket.io/privacy.html,True,False,False,2021-06-16 00:05:22,2020.0
1,아가펫츠,com.suholdings.agapets,Lifestyle,5.0,8.0,100+,100.0,152,True,0.0,...,info@agapets.com,2020-06-16,2020-08-06,Teen,http://www.agapets.com/terms/terms_privacy.html,True,False,False,2021-06-16 02:17:56,2020.0


In [4]:
# Reihenfolge des Dataframes geändert
apps = apps[['App Name', 'App Id', 'Rating', 'Rating Count', 'Installs', 'Minimum Installs', 'Maximum Installs', 'Free', 
             'Price', 'Currency', 'Size', 'Minimum Android', 'Developer Id', 'Developer Website', 'Developer Email', 
             'Released', 'Last Updated', 'Content Rating', 'Privacy Policy', 'Ad Supported', 'In App Purchases', 'Editors Choice', 'Scraped Time', 'Category']]

In [5]:
# Alle Spalten mit Unique-Werten werden gedropped - zu viel Rechenkapa notwendig
# apps_v2 = apps.copy()
apps.drop(columns=['App Name', 'App Id', 'Developer Id', 'Developer Website','Minimum Android', 'Developer Email', 'Privacy Policy', 'Released', 'Scraped Time', 'Last Updated'], inplace=True)

In [6]:
apps['Free']             = apps['Free'].astype(float)
apps['Ad Supported']     = apps['Ad Supported'].astype(float)
apps['Editors Choice']   = apps['Editors Choice'].astype(float)
apps['In App Purchases'] = apps['In App Purchases'].astype(float)

In [7]:
# Daten aufbereiten und in Test- und Trainingsdaten aufteilen
apps = apps.dropna()
apps.select_dtypes(include=['object'])

# dataset1 = apps.dropna().loc[:,['Rating','Rating Count', 'Size', 'Minimum Installs', 'Maximum Installs', 'Price']] 
# dataset2 = apps.dropna().loc[:, 'Rating':'Editors Choice'] 
# list = dataset2.select_dtypes(include=["object"])
# X_onehot = pd.get_dummies(list)

Unnamed: 0,Installs,Currency,Content Rating,Category
0,100+,USD,Everyone,Shopping
1,100+,USD,Teen,Lifestyle
2,"1,000+",USD,Everyone,Communication
3,100+,USD,Teen,Social
4,50+,USD,Everyone,Puzzle
...,...,...,...,...
2290043,10+,USD,Everyone,Health & Fitness
2290044,10+,USD,Everyone,Finance
2290045,5+,USD,Everyone,Auto & Vehicles
2290046,500+,USD,Everyone,Productivity


In [8]:
apps.dtypes

Rating              float64
Rating Count        float64
Installs             object
Minimum Installs    float64
Maximum Installs      int64
Free                float64
Price               float64
Currency             object
Size                float64
Content Rating       object
Ad Supported        float64
In App Purchases    float64
Editors Choice      float64
Category             object
dtype: object

In [9]:
numerical_cols = list(apps.select_dtypes(include="float").columns)
categorical_cols = list(apps.select_dtypes(include="object").columns)

In [10]:
categorical_cols.remove("Category")
categorical_cols

['Installs', 'Currency', 'Content Rating']

In [11]:
X_dumm = pd.get_dummies(apps[categorical_cols])

In [12]:
X_dumm.head(2)

Unnamed: 0,Installs_0+,Installs_1+,"Installs_1,000+","Installs_1,000,000+","Installs_1,000,000,000+",Installs_10+,"Installs_10,000+","Installs_10,000,000+","Installs_10,000,000,000+",Installs_100+,...,Currency_USD,Currency_VND,Currency_XXX,Currency_ZAR,Content Rating_Adults only 18+,Content Rating_Everyone,Content Rating_Everyone 10+,Content Rating_Mature 17+,Content Rating_Teen,Content Rating_Unrated
0,0,0,0,0,0,0,0,0,0,1,...,1,0,0,0,0,1,0,0,0,0
1,0,0,0,0,0,0,0,0,0,1,...,1,0,0,0,0,0,0,0,1,0


In [13]:
X = pd.concat([apps[numerical_cols], X_dumm], axis = 1)

In [14]:
X.head()

Unnamed: 0,Rating,Rating Count,Minimum Installs,Free,Price,Size,Ad Supported,In App Purchases,Editors Choice,Installs_0+,...,Currency_USD,Currency_VND,Currency_XXX,Currency_ZAR,Content Rating_Adults only 18+,Content Rating_Everyone,Content Rating_Everyone 10+,Content Rating_Mature 17+,Content Rating_Teen,Content Rating_Unrated
0,5.0,20.0,100.0,1.0,0.0,6.7,1.0,0.0,0.0,0,...,1,0,0,0,0,1,0,0,0,0
1,5.0,8.0,100.0,1.0,0.0,71.0,1.0,0.0,0.0,0,...,1,0,0,0,0,0,0,0,1,0
2,5.0,8.0,1000.0,1.0,0.0,2.4,0.0,0.0,0.0,0,...,1,0,0,0,0,1,0,0,0,0
3,5.0,22.0,100.0,1.0,0.0,50.0,0.0,0.0,0.0,0,...,1,0,0,0,0,0,0,0,1,0
4,5.0,13.0,50.0,1.0,0.0,18.0,1.0,1.0,0.0,0,...,1,0,0,0,0,1,0,0,0,0


In [15]:
y = apps['Category'] #.copy()
print(f"X und y haben gleiche Anzahl: {X.shape[0] == y.shape[0]}")
# Wenn die Daten noch (weiter) aufgeteilt werden müssen, geht das am besten mit der 
# Methode sklearn.model_selection.train_test_split
# Sie liefert Features und Zielklassen für test und train zurück und lässt sich umfassend parametrisieren.

# Wir können also noch Entwicklungsdaten erzeugen, die 20% (test_size=0.2) der Trainingsdaten umfassen und alle vier Zielklassen 
# gleichermaßen berücksichtigen (stratifizieren, stratify=target_train):

# X_train, X_test, y_train, y_test = train_test_split(X_concat, target_dataset, test_size=0.2, random_state=42, stratify=target_dataset)

X und y haben gleiche Anzahl: True


In [16]:
label_encoder = LabelEncoder()

In [17]:
y = label_encoder.fit_transform(y) # macht alles zu 0, 1, 2,3 ...

In [18]:
X_train, X_test1, y_train, y_test1 = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Testdaten in Entwicklung und echten Test aufteilen (50-50, stratifiziert)
X_dev, X_test, y_dev, y_test = train_test_split(X_test1, y_test1, test_size=0.5, stratify=y_test1, random_state=42)

In [19]:
# Features auswählen
# mit threshold=None wird ein absoluter default-Schwellwert verwendet
# feature_selection = SelectFromModel(LinearSVC(penalty="l1", dual=False), threshold='mean')
# TfidfVectorizer wird nicht durchgeführt, da keine txt-Datein im Datensatz vorhanden sind

In [20]:
# Parameter des Algorithmus setzen
# Pipeline neu definieren - ohne Parametersetzen 

feature_selection=SelectFromModel(LinearSVC(penalty="l1", dual=False))
classifier = SVC()

pipeline = Pipeline([('feature_selection', feature_selection), ('classifier', classifier)])

# # Parameterraum definieren: key ist schrittname__parametername, value die zu prüfenden Werte

parameters = {  
    'feature_selection__threshold': (None, 'mean'), 
    'classifier__kernel': ('linear','rbf'),
}

# alles manuell statt gridsearch durchführen
# also 2x feature selection jeweilis mit None und mean, dann innerhalb dieser zwei feature selection jeweils test und traindaten auswerten --> In Summe 4 Ergebnisse
# also 2x classifier jeweilis mit linear und rbf, dann innerhalb dieser zwei classifier jeweils test und traindaten auswerten --> In Summe 4 Ergebnisse

# Suche über den gesamten Parameterraum (cross validation über die Trainingsdaten)
grid_search = GridSearchCV(pipeline, param_grid=parameters, verbose=10)

grid_search.fit(X_train[:50], y_train[:50])

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV 1/5; 1/4] START classifier__kernel=linear, feature_selection__threshold=None




[CV 1/5; 1/4] END classifier__kernel=linear, feature_selection__threshold=None;, score=0.000 total time= 2.4min
[CV 2/5; 1/4] START classifier__kernel=linear, feature_selection__threshold=None




[CV 2/5; 1/4] END classifier__kernel=linear, feature_selection__threshold=None;, score=0.100 total time=   1.9s
[CV 3/5; 1/4] START classifier__kernel=linear, feature_selection__threshold=None




[CV 3/5; 1/4] END classifier__kernel=linear, feature_selection__threshold=None;, score=0.100 total time=  39.0s
[CV 4/5; 1/4] START classifier__kernel=linear, feature_selection__threshold=None




[CV 4/5; 1/4] END classifier__kernel=linear, feature_selection__threshold=None;, score=0.100 total time=   7.9s
[CV 5/5; 1/4] START classifier__kernel=linear, feature_selection__threshold=None




[CV 5/5; 1/4] END classifier__kernel=linear, feature_selection__threshold=None;, score=0.000 total time=  39.5s
[CV 1/5; 2/4] START classifier__kernel=linear, feature_selection__threshold=mean
[CV 1/5; 2/4] END classifier__kernel=linear, feature_selection__threshold=mean;, score=0.100 total time=   0.0s
[CV 2/5; 2/4] START classifier__kernel=linear, feature_selection__threshold=mean
[CV 2/5; 2/4] END classifier__kernel=linear, feature_selection__threshold=mean;, score=0.000 total time=   0.0s
[CV 3/5; 2/4] START classifier__kernel=linear, feature_selection__threshold=mean
[CV 3/5; 2/4] END classifier__kernel=linear, feature_selection__threshold=mean;, score=0.100 total time=   0.0s
[CV 4/5; 2/4] START classifier__kernel=linear, feature_selection__threshold=mean




[CV 4/5; 2/4] END classifier__kernel=linear, feature_selection__threshold=mean;, score=0.000 total time=   0.6s
[CV 5/5; 2/4] START classifier__kernel=linear, feature_selection__threshold=mean
[CV 5/5; 2/4] END classifier__kernel=linear, feature_selection__threshold=mean;, score=0.100 total time=   0.0s
[CV 1/5; 3/4] START classifier__kernel=rbf, feature_selection__threshold=None...
[CV 1/5; 3/4] END classifier__kernel=rbf, feature_selection__threshold=None;, score=0.100 total time=   0.0s
[CV 2/5; 3/4] START classifier__kernel=rbf, feature_selection__threshold=None...
[CV 2/5; 3/4] END classifier__kernel=rbf, feature_selection__threshold=None;, score=0.100 total time=   0.0s
[CV 3/5; 3/4] START classifier__kernel=rbf, feature_selection__threshold=None...
[CV 3/5; 3/4] END classifier__kernel=rbf, feature_selection__threshold=None;, score=0.100 total time=   0.0s
[CV 4/5; 3/4] START classifier__kernel=rbf, feature_selection__threshold=None...
[CV 4/5; 3/4] END classifier__kernel=rbf, fe



[CV 4/5; 4/4] END classifier__kernel=rbf, feature_selection__threshold=mean;, score=0.100 total time=   0.0s
[CV 5/5; 4/4] START classifier__kernel=rbf, feature_selection__threshold=mean...
[CV 5/5; 4/4] END classifier__kernel=rbf, feature_selection__threshold=mean;, score=0.100 total time=   0.0s




GridSearchCV(estimator=Pipeline(steps=[('feature_selection',
                                        SelectFromModel(estimator=LinearSVC(dual=False,
                                                                            penalty='l1'))),
                                       ('classifier', SVC())]),
             param_grid={'classifier__kernel': ('linear', 'rbf'),
                         'feature_selection__threshold': (None, 'mean')},
             verbose=10)

In [21]:
# Welche Parameterkombination ist die beste?

print(grid_search.best_estimator_)

# Wenn kein expliziter Parameter angegeben ist, hat der default am besten funktioniert.
# Wir wählen also  
# 'feature_selection__threshold': None, 
# 'classifier__kernel': 'rbf'

Pipeline(steps=[('feature_selection',
                 SelectFromModel(estimator=LinearSVC(dual=False,
                                                     penalty='l1'))),
                ('classifier', SVC())])


In [22]:
# Pipeline für die beste Feature-Kombination definieren
final_pipeline = Pipeline(steps=[('feature_selection',
                 SelectFromModel(estimator=LinearSVC(dual=False,
                                                     penalty='l1'))),
                ('classifier', SVC())])

# Wie gut ist der fertige Lerner auf den Trainingsdaten? 
# Evaluation per Crossvalidation (analog zur Parametersuche)
# Mit cross_val_predict merken wir uns die Vorhersage für jeden Datenpunkt, die gemacht wird, wenn er zum Testset 
# gehört; die Vorhersagen sind also ungesehen und liegen für den gesamten Datensatz vor.

train_labels = cross_val_predict(final_pipeline, X_train[:2000] , y_train[:2000], cv=5)

# Precision/Recall/F-Wert berechnen

print(classification_report(y_train[:2000], train_labels[:2000]))



              precision    recall  f1-score   support

           0       0.00      0.00      0.00        24
           1       0.00      0.00      0.00        18
           2       0.00      0.00      0.00        53
           3       0.00      0.00      0.00        14
           4       0.00      0.00      0.00        18
           5       0.00      0.00      0.00         7
           6       0.00      0.00      0.00        10
           7       0.00      0.00      0.00       103
           8       0.00      0.00      0.00       122
           9       0.00      0.00      0.00         9
          10       0.00      0.00      0.00         3
          11       0.00      0.00      0.00        48
          12       0.00      0.00      0.00         1
          13       0.00      0.00      0.00        38
          14       0.00      0.00      0.00         6
          15       0.11      1.00      0.19       213
          16       0.00      0.00      0.00        17
          17       0.00    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [23]:
# Schritt 5: Vorhersagen auf Testdaten machen und evaluieren

from sklearn.metrics import classification_report, confusion_matrix

# Jetzt den Lerner ein letztes Mal auf allen Trainingsdaten trainieren und dann auf den Testdaten evaluieren

# Lerner auf den gesamten Trainingsdaten trainieren
final_pipeline.fit(X_train[:2000], y_train[:2000])

# Lerner auf den Testdaten evaluieren

# Mit dem default score des Lerners: (durchschnittliche Accuracy bei SVC)

print("Default-Score des Klassifizierers: Accuracy=", final_pipeline.score(X_test, y_test), "\n")

# Labels vorhersagen lassen und dann Precision/Recall/F-Wert berechnen
test_labels = final_pipeline.predict(X_dev)

print(classification_report(y_dev, test_labels))



Default-Score des Klassifizierers: Accuracy= 0.10536543189390532 



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.00      0.00      0.00      2682
           1       0.00      0.00      0.00      2279
           2       0.00      0.00      0.00      5271
           3       0.00      0.00      0.00      1839
           4       0.00      0.00      0.00      1795
           5       0.00      0.00      0.00      1170
           6       0.00      0.00      0.00      1045
           7       0.00      0.00      0.00     11649
           8       0.00      0.00      0.00     14215
           9       0.00      0.00      0.00       804
          10       0.00      0.00      0.00       504
          11       0.05      0.01      0.01      4925
          12       0.00      0.00      0.00       282
          13       0.00      0.00      0.00      4754
          14       0.00      0.00      0.00       645
          15       0.11      0.99      0.19     23979
          16       0.00      0.00      0.00      2106
          17       0.07    

  _warn_prf(average, modifier, msg_start, len(result))


**Notes:**

- 