# 6. Klassifikation - Modeling
#### Überwachtes Klassifikationsverfahren
## Anforderungen an Projektumsetzung: Klassifikation

---
**AUFGABE:**

- Führen Sie mit dem Algorithmus Ihrer Wahl eine Klassifikationsaufgabe auf Ihren Daten durch.
- Teilen Sie dazu zunächst die Daten auf, um Overfitting beim Trainieren des Algorithmus und bei der Parameterauswahl zu vermeiden. Erklären Sie die gewählte Strategie und die Größenverhältnisse.
- Wählen Sie geeignete Features aus und setzen Sie die Parameter des Algorithmus. Beschreiben Sie das gewälhte Vorgehen für die Auswahl der Features und Parameter. Berichten Sie den Parameterraum und die final gewählten Parameter. Geben Sie die Performanz auf den Trainingsdaten (bzw. Entwicklungsdaten, falls verwendet) an.
- Evaluieren Sie die Klassifikation auf den ungesehenen Testdaten. Betrachten Sie Precision und Recall sowie den F-Wert. Welches Maß ist für Ihre Anwendung wichtiger? Bewerten Sie Ihr Ergebnis. Ist es in der Praxis voraussichtlich zufriedenstellend?

In [1]:
# Imports für unten

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectFromModel
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import cross_val_predict
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

In [2]:
apps = pd.read_csv("Daten/Google-Playstore_Edit2.csv")

In [3]:
# alle Kategorien löschen, die für uns als Unternehmen irrelevant und für eine Entwicklung ausgeschlossen sind
less_apps = apps[apps['Category'] == 'Action'] + apps[apps['Category'] == 'Arcade'] + apps[apps['Category'] == 'Beauty'] + apps[apps['Category'] == 'Casino'] + apps[apps['Category'] == 'Comics'] + apps[apps['Category'] == 'Dating'] + apps[apps['Category'] == 'Educational'] + apps[apps['Category'] == 'Puzzle'] + apps[apps['Category'] == 'Racing'] + apps[apps['Category'] == 'Role Playing'] + apps[apps['Category'] == 'Shopping'] + apps[apps['Category'] == 'Trivia'] + apps[apps['Category'] == 'Video Players & Editors'] 
            
apps.drop(less_apps.index, axis=0, inplace=True)            

In [4]:
cat = apps.groupby('Category')
cat['Category'].count()

Category
Adventure             23196
Art & Design          18538
Auto & Vehicles       18278
Board                 10588
Books & Reference    116726
Business             143761
Card                   8179
Casual                50797
Communication         48159
Education            241075
Entertainment        138268
Events                12839
Finance               65456
Food & Drink          73920
Health & Fitness      83501
House & Home          14369
Libraries & Demo       5196
Lifestyle            118324
Maps & Navigation     26722
Medical               32063
Music                  4202
Music & Audio        154898
News & Magazines      42804
Parenting              3810
Personalization       89210
Photography           35552
Productivity          79686
Simulation            23276
Social                44729
Sports                47478
Strategy               8525
Tools                143976
Travel & Local        67282
Weather                7245
Word                   8630
Name: Categ

In [5]:
# Datensatz random auf die Hälfte reduzieren und neues DataFrame erstellen
half_apps = apps.sample(frac = 0.5)

In [6]:
# Alle Spalten mit Unique-Werten werden gedropped - zu viel Rechenkapa notwendig
# apps_v2 = apps.copy()
half_apps.drop(columns=['App Name', 'App Id', 'Developer Id', 'Developer Website','Minimum Android', 'Developer Email', 'Privacy Policy', 'Released', 'Scraped Time', 'Last Updated'], inplace=True)

In [7]:
# Umwandlung in Float-Werte
half_apps['Free']             = half_apps['Free'].astype(float)
half_apps['Ad Supported']     = half_apps['Ad Supported'].astype(float)
half_apps['Editors Choice']   = half_apps['Editors Choice'].astype(float)
half_apps['In App Purchases'] = half_apps['In App Purchases'].astype(float)
half_apps['Maximum Installs'] = half_apps['Maximum Installs'].astype(float)

In [8]:
half_apps.dtypes

Category             object
Rating              float64
Rating Count        float64
Installs             object
Minimum Installs    float64
Maximum Installs    float64
Free                float64
Price               float64
Currency             object
Size                float64
Content Rating       object
Ad Supported        float64
In App Purchases    float64
Editors Choice      float64
Released Year       float64
dtype: object

In [9]:
# Selektion von den Spalten vom Typ object
half_apps = half_apps.dropna()
half_apps.select_dtypes(include=['object'])

Unnamed: 0,Category,Installs,Currency,Content Rating
1032616,Tools,"10,000+",USD,Everyone
737983,Entertainment,"5,000+",USD,Everyone
132989,Books & Reference,"10,000+",USD,Everyone
2277604,Lifestyle,5+,USD,Everyone
1457434,Medical,100+,USD,Everyone
...,...,...,...,...
1058296,Business,"10,000+",USD,Everyone
1241051,Tools,10+,USD,Everyone
1306746,Personalization,1+,USD,Everyone
1270299,Entertainment,100+,USD,Mature 17+


In [10]:
# Aufteilung in Listen mit numerischen und mit noch kategorischen Werten
numerical_cols = list(half_apps.select_dtypes(include="float").columns)
categorical_cols = list(half_apps.select_dtypes(include="object").columns)

In [11]:
# Löschen von Category, da dies dann als Zielklasse verwendet werden soll
categorical_cols.remove("Category")
categorical_cols

['Installs', 'Currency', 'Content Rating']

In [12]:
# Da Klassifikation nur mit numerischen Daten funktioniert, werden mittels
# One-Hot-Endcoding aus den kategorischen Spalten, numerische Daten generiert
X_dumm = pd.get_dummies(half_apps[categorical_cols])

In [13]:
X_dumm.head(2)

Unnamed: 0,Installs_0+,Installs_1+,"Installs_1,000+","Installs_1,000,000+","Installs_1,000,000,000+",Installs_10+,"Installs_10,000+","Installs_10,000,000+","Installs_10,000,000,000+",Installs_100+,...,Currency_INR,Currency_USD,Currency_VND,Currency_XXX,Content Rating_Adults only 18+,Content Rating_Everyone,Content Rating_Everyone 10+,Content Rating_Mature 17+,Content Rating_Teen,Content Rating_Unrated
1032616,0,0,0,0,0,0,1,0,0,0,...,0,1,0,0,0,1,0,0,0,0
737983,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0


In [14]:
# Zusammenfügen beider numerischen Listen
X = pd.concat([half_apps[numerical_cols], X_dumm], axis = 1)

In [15]:
X.head()

Unnamed: 0,Rating,Rating Count,Minimum Installs,Maximum Installs,Free,Price,Size,Ad Supported,In App Purchases,Editors Choice,...,Currency_INR,Currency_USD,Currency_VND,Currency_XXX,Content Rating_Adults only 18+,Content Rating_Everyone,Content Rating_Everyone 10+,Content Rating_Mature 17+,Content Rating_Teen,Content Rating_Unrated
1032616,3.5,26.0,10000.0,13230.0,1.0,0.0,0.392,1.0,0.0,0.0,...,0,1,0,0,0,1,0,0,0,0
737983,4.1,53.0,5000.0,5571.0,1.0,0.0,10.0,1.0,0.0,0.0,...,0,1,0,0,0,1,0,0,0,0
132989,4.9,240.0,10000.0,34276.0,1.0,0.0,0.522,0.0,0.0,0.0,...,0,1,0,0,0,1,0,0,0,0
2277604,0.0,0.0,5.0,5.0,1.0,0.0,21.0,0.0,0.0,0.0,...,0,1,0,0,0,1,0,0,0,0
1457434,0.0,0.0,100.0,297.0,1.0,0.0,15.0,0.0,0.0,0.0,...,0,1,0,0,0,1,0,0,0,0


In [16]:
y = half_apps['Category'] #.copy()
print(f"X und y haben gleiche Anzahl: {X.shape[0] == y.shape[0]}")

X und y haben gleiche Anzahl: True


In [17]:
label_encoder = LabelEncoder()

In [18]:
y = label_encoder.fit_transform(y) # macht alles zu 0, 1, 2,3 ...

In [19]:
# Daten in Trainings- und Test aufteilen
X_train, X_test1, y_train, y_test1 = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Testdaten in Entwicklung und echten Test aufteilen (50-50, stratifiziert)
X_dev, X_test, y_dev, y_test = train_test_split(X_test1, y_test1, test_size=0.5, stratify=y_test1, random_state=42)

- Gridsearch durchführen
- manuelle Vorhersagen ebenso durchführen

In [21]:
# Parameter des Algorithmus setzen
# Pipeline neu definieren - ohne Parametersetzen 

feature_selection=SelectFromModel(LinearSVC(penalty="l1", dual=False))
classifier = RandomForestClassifier(n_estimators=20, random_state=0)

pipeline = Pipeline([('feature_selection', feature_selection), ('classifier', classifier)])

# Parameterraum definieren: key ist schrittname__parametername, value die zu prüfenden Werte

parameters = {  
    'feature_selection__threshold': (None, 'mean'), 
    'classifier__criterion': ('gini','entropy')
}

# Suche über den gesamten Parameterraum (cross validation über die Trainingsdaten)
grid_search = GridSearchCV(pipeline, param_grid=parameters, verbose=10)

grid_search.fit(X_train[:50000], y_train[:50000])

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV 1/5; 1/4] START classifier__criterion=gini, feature_selection__threshold=None




[CV 1/5; 1/4] END classifier__criterion=gini, feature_selection__threshold=None;, score=0.152 total time= 1.2min
[CV 2/5; 1/4] START classifier__criterion=gini, feature_selection__threshold=None




[CV 2/5; 1/4] END classifier__criterion=gini, feature_selection__threshold=None;, score=0.146 total time= 1.0min
[CV 3/5; 1/4] START classifier__criterion=gini, feature_selection__threshold=None




[CV 3/5; 1/4] END classifier__criterion=gini, feature_selection__threshold=None;, score=0.146 total time= 1.2min
[CV 4/5; 1/4] START classifier__criterion=gini, feature_selection__threshold=None




[CV 4/5; 1/4] END classifier__criterion=gini, feature_selection__threshold=None;, score=0.153 total time= 1.2min
[CV 5/5; 1/4] START classifier__criterion=gini, feature_selection__threshold=None




[CV 5/5; 1/4] END classifier__criterion=gini, feature_selection__threshold=None;, score=0.160 total time= 1.1min
[CV 1/5; 2/4] START classifier__criterion=gini, feature_selection__threshold=mean




[CV 1/5; 2/4] END classifier__criterion=gini, feature_selection__threshold=mean;, score=0.148 total time= 1.1min
[CV 2/5; 2/4] START classifier__criterion=gini, feature_selection__threshold=mean




[CV 2/5; 2/4] END classifier__criterion=gini, feature_selection__threshold=mean;, score=0.156 total time= 1.1min
[CV 3/5; 2/4] START classifier__criterion=gini, feature_selection__threshold=mean




[CV 3/5; 2/4] END classifier__criterion=gini, feature_selection__threshold=mean;, score=0.151 total time= 1.1min
[CV 4/5; 2/4] START classifier__criterion=gini, feature_selection__threshold=mean




[CV 4/5; 2/4] END classifier__criterion=gini, feature_selection__threshold=mean;, score=0.152 total time= 1.1min
[CV 5/5; 2/4] START classifier__criterion=gini, feature_selection__threshold=mean




[CV 5/5; 2/4] END classifier__criterion=gini, feature_selection__threshold=mean;, score=0.143 total time= 1.1min
[CV 1/5; 3/4] START classifier__criterion=entropy, feature_selection__threshold=None




[CV 1/5; 3/4] END classifier__criterion=entropy, feature_selection__threshold=None;, score=0.148 total time= 1.1min
[CV 2/5; 3/4] START classifier__criterion=entropy, feature_selection__threshold=None




[CV 2/5; 3/4] END classifier__criterion=entropy, feature_selection__threshold=None;, score=0.145 total time= 1.2min
[CV 3/5; 3/4] START classifier__criterion=entropy, feature_selection__threshold=None




[CV 3/5; 3/4] END classifier__criterion=entropy, feature_selection__threshold=None;, score=0.147 total time= 1.1min
[CV 4/5; 3/4] START classifier__criterion=entropy, feature_selection__threshold=None




[CV 4/5; 3/4] END classifier__criterion=entropy, feature_selection__threshold=None;, score=0.156 total time= 1.2min
[CV 5/5; 3/4] START classifier__criterion=entropy, feature_selection__threshold=None




[CV 5/5; 3/4] END classifier__criterion=entropy, feature_selection__threshold=None;, score=0.151 total time= 1.1min
[CV 1/5; 4/4] START classifier__criterion=entropy, feature_selection__threshold=mean




[CV 1/5; 4/4] END classifier__criterion=entropy, feature_selection__threshold=mean;, score=0.151 total time= 1.1min
[CV 2/5; 4/4] START classifier__criterion=entropy, feature_selection__threshold=mean




[CV 2/5; 4/4] END classifier__criterion=entropy, feature_selection__threshold=mean;, score=0.158 total time= 1.1min
[CV 3/5; 4/4] START classifier__criterion=entropy, feature_selection__threshold=mean




[CV 3/5; 4/4] END classifier__criterion=entropy, feature_selection__threshold=mean;, score=0.143 total time= 1.1min
[CV 4/5; 4/4] START classifier__criterion=entropy, feature_selection__threshold=mean




[CV 4/5; 4/4] END classifier__criterion=entropy, feature_selection__threshold=mean;, score=0.149 total time= 1.2min
[CV 5/5; 4/4] START classifier__criterion=entropy, feature_selection__threshold=mean




[CV 5/5; 4/4] END classifier__criterion=entropy, feature_selection__threshold=mean;, score=0.143 total time= 1.1min




GridSearchCV(estimator=Pipeline(steps=[('feature_selection',
                                        SelectFromModel(estimator=LinearSVC(dual=False,
                                                                            penalty='l1'))),
                                       ('classifier',
                                        RandomForestClassifier(n_estimators=20,
                                                               random_state=0))]),
             param_grid={'classifier__criterion': ('gini', 'entropy'),
                         'feature_selection__threshold': (None, 'mean')},
             verbose=10)

In [22]:
# Welche Parameterkombination ist die beste?
print(grid_search.best_estimator_)

Pipeline(steps=[('feature_selection',
                 SelectFromModel(estimator=LinearSVC(dual=False,
                                                     penalty='l1'))),
                ('classifier',
                 RandomForestClassifier(n_estimators=20, random_state=0))])


In [24]:
# Pipeline für die beste Feature-Kombination definieren
# Parameter aus dem .best_estimator Ergebnis entnehmen
final_pipeline = Pipeline(steps=[('feature_selection',
                 SelectFromModel(estimator=LinearSVC(dual=False, penalty='l1'),
                                 threshold='mean')),
                ('classifier',
                 RandomForestClassifier(n_estimators=20, random_state=0))])

train_labels = cross_val_predict(final_pipeline, X_train[:50000] , y_train[:50000], cv=10)

# Precision/Recall/F-Wert berechnen

print(classification_report(y_train[:50000], train_labels[:50000]))



              precision    recall  f1-score   support

           0       0.25      0.02      0.04       582
           1       0.00      0.00      0.00       477
           2       0.00      0.00      0.00       415
           3       0.25      0.00      0.01       283
           4       0.20      0.00      0.01      2977
           5       0.12      0.02      0.04      3637
           6       0.00      0.00      0.00       200
           7       0.00      0.00      0.00      1262
           8       0.00      0.00      0.00      1170
           9       0.13      0.93      0.23      6089
          10       0.07      0.00      0.00      3457
          11       0.00      0.00      0.00       328
          12       0.25      0.00      0.00      1618
          13       0.00      0.00      0.00      1829
          14       0.00      0.00      0.00      2073
          15       0.00      0.00      0.00       358
          16       0.00      0.00      0.00       139
          17       0.50    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Vorhersagen auf den Trainingsdaten

In [27]:
# Parameter des Algorithmus setzen
# Mit SelectFromModel-Parameter threshold=None und RandomForest criterion=gini

pipeline1 = Pipeline(steps=[
                ('feature_selection',
                 SelectFromModel(estimator=LinearSVC(dual=False,
                                                     penalty='l1'))),
                ('classifier', RandomForestClassifier(n_estimators=20, random_state=0))])

pipeline1.fit(X_train[:50000], y_train[:50000])

train_labels1 = pipeline1.predict(X_train[:50000])
      
print(classification_report(y_train[:50000], train_labels1[:50000]))



              precision    recall  f1-score   support

           0       0.87      0.84      0.85       582
           1       0.80      0.70      0.75       477
           2       0.84      0.66      0.74       415
           3       0.90      0.85      0.87       283
           4       0.83      0.81      0.82      2977
           5       0.63      0.76      0.69      3637
           6       0.98      0.88      0.92       200
           7       0.78      0.84      0.81      1262
           8       0.81      0.75      0.78      1170
           9       0.76      0.80      0.78      6089
          10       0.81      0.81      0.81      3457
          11       0.71      0.57      0.64       328
          12       0.83      0.77      0.80      1618
          13       0.64      0.68      0.66      1829
          14       0.77      0.73      0.75      2073
          15       0.66      0.60      0.63       358
          16       0.78      0.71      0.74       139
          17       0.76    

In [29]:
# Parameter des Algorithmus setzen
# Mit SelectFromModel-Parameter threshold='mean' und RandomForest criterion=gini

pipeline2 = Pipeline(steps=[
                ('feature_selection',
                 SelectFromModel(estimator=LinearSVC(dual=False,
                                                     penalty='l1'), threshold='mean')),
                ('classifier', RandomForestClassifier(n_estimators=20, random_state=0))])

pipeline2.fit(X_train[:50000], y_train[:50000])

train_labels2 = pipeline2.predict(X_train[:50000])
      
print(classification_report(y_train[:50000], train_labels2[:50000]))



              precision    recall  f1-score   support

           0       0.29      0.02      0.04       582
           1       0.00      0.00      0.00       477
           2       0.00      0.00      0.00       415
           3       0.00      0.00      0.00       283
           4       0.00      0.00      0.00      2977
           5       0.25      0.00      0.00      3637
           6       0.00      0.00      0.00       200
           7       0.22      0.00      0.00      1262
           8       0.00      0.00      0.00      1170
           9       0.13      0.96      0.23      6089
          10       0.38      0.00      0.01      3457
          11       0.00      0.00      0.00       328
          12       0.00      0.00      0.00      1618
          13       0.00      0.00      0.00      1829
          14       0.00      0.00      0.00      2073
          15       0.00      0.00      0.00       358
          16       0.00      0.00      0.00       139
          17       0.00    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Vorhersage auf den Entwicklungsdaten --> X_dev

In [31]:
# Parameter des Algorithmus setzen
# Mit SelectFromModel-Parameter threshold=None und RandomForest criterion=gini

pipeline6 = Pipeline(steps=[
                ('feature_selection',
                 SelectFromModel(estimator=LinearSVC(dual=False,
                                                     penalty='l1'))),
                ('classifier', RandomForestClassifier(n_estimators=20, random_state=0))])

pipeline6.fit(X_train[:50000], y_train[:50000])

print("Default-Score des Klassifizierers: Accuracy=",pipeline6.score(X_test1, y_test1), "\n")

train_labels6 = pipeline6.predict(X_dev)
      
print(classification_report(y_dev, train_labels6))



Default-Score des Klassifizierers: Accuracy= 0.14679698877971734 

              precision    recall  f1-score   support

           0       0.11      0.10      0.10      1141
           1       0.03      0.02      0.03       913
           2       0.03      0.02      0.02       846
           3       0.01      0.01      0.01       519
           4       0.16      0.16      0.16      5785
           5       0.15      0.19      0.17      6853
           6       0.05      0.04      0.04       395
           7       0.12      0.12      0.12      2441
           8       0.05      0.04      0.04      2291
           9       0.19      0.22      0.21     11804
          10       0.13      0.14      0.14      6772
          11       0.02      0.02      0.02       618
          12       0.11      0.10      0.10      3094
          13       0.14      0.14      0.14      3542
          14       0.15      0.14      0.15      4045
          15       0.03      0.03      0.03       677
          16  

In [30]:
# Parameter des Algorithmus setzen
# Mit SelectFromModel-Parameter threshold='mean' und RandomForest criterion=gini

pipeline5 = Pipeline(steps=[
                ('feature_selection',
                 SelectFromModel(estimator=LinearSVC(dual=False,
                                                     penalty='l1'), threshold='mean')),
                ('classifier', RandomForestClassifier(n_estimators=20, random_state=0))])

pipeline5.fit(X_train[:50000], y_train[:50000])

print("Default-Score des Klassifizierers: Accuracy=",pipeline5.score(X_test1, y_test1), "\n")

train_labels5 = pipeline5.predict(X_dev)
      
print(classification_report(y_dev, train_labels5))



Default-Score des Klassifizierers: Accuracy= 0.1415509425447683 

              precision    recall  f1-score   support

           0       0.14      0.03      0.04      1141
           1       0.00      0.00      0.00       913
           2       0.00      0.00      0.00       846
           3       0.00      0.00      0.00       519
           4       0.00      0.00      0.00      5785
           5       0.00      0.00      0.00      6853
           6       0.00      0.00      0.00       395
           7       0.00      0.00      0.00      2441
           8       0.00      0.00      0.00      2291
           9       0.13      0.96      0.23     11804
          10       0.15      0.00      0.01      6772
          11       0.00      0.00      0.00       618
          12       0.00      0.00      0.00      3094
          13       0.00      0.00      0.00      3542
          14       0.00      0.00      0.00      4045
          15       0.00      0.00      0.00       677
          16   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Auswertung Trainings- vs. Testdaten
- grundsätzlich haben die Trainingsdaten eine höhere Accuracy als die Entwicklungs- bzw. Testdaten
----
Trotz zufälliger Halbierung des Datensatzes, Eliminierung von NaN-Werten und Eingrenzung der Zielklassen, können aufgrund der zahlreichen Datensätze keine Schlüsse gezogen werden. Die Accuracy liegt bei allen Vorhersagen recht niedrig, da bei vielen Zielklassen die Precision und F-score `0.0` bewertet wurden und zu den jeweiligen Zielklassen keine aussagekräftigen Trainings- oder Testdaten zugeordnet werden konnten.