# 6. Klassifikation
## Anforderungen an Projektumsetzung: Klassifikation

---
**AUFGABE:**

- Führen Sie mit dem Algorithmus Ihrer Wahl eine Klassifikationsaufgabe auf Ihren Daten durch.
- Teilen Sie dazu zunächst die Daten auf, um Overfitting beim Trainieren des Algorithmus und bei der Parameterauswahl zu vermeiden. Erklären Sie die gewählte Strategie und die Größenverhältnisse.
- Wählen Sie geeignete Features aus und setzen Sie die Parameter des Algorithmus. Beschreiben Sie das gewälhte Vorgehen für die Auswahl der Features und Parameter. Berichten Sie den Parameterraum und die final gewählten Parameter. Geben Sie die Performanz auf den Trainingsdaten (bzw. Entwicklungsdaten, falls verwendet) an.
- Evaluieren Sie die Klassifikation auf den ungesehenen Testdaten. Betrachten Sie Precision und Recall sowie den F-Wert. Welches Maß ist für Ihre Anwendung wichtiger? Bewerten Sie Ihr Ergebnis. Ist es in der Praxis voraussichtlich zufriedenstellend?

In [1]:
# Imports für unten

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectFromModel
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import cross_val_predict
from sklearn.preprocessing import LabelEncoder
import xgboost as xgb
from xgboost import XGBClassifier
import pandas as pd

In [2]:
apps = pd.read_csv("Daten/Google-Playstore_Edit2.csv")

In [3]:
# alle Kategorien löschen, die für uns als Unternehmen irrelevant und für eine Entwicklung ausgeschlossen sind
less_apps = apps[apps['Category'] == 'Action'] + apps[apps['Category'] == 'Arcade'] + apps[apps['Category'] == 'Beauty'] + apps[apps['Category'] == 'Casino'] + apps[apps['Category'] == 'Comics'] + apps[apps['Category'] == 'Dating'] + apps[apps['Category'] == 'Educational'] + apps[apps['Category'] == 'Puzzle'] + apps[apps['Category'] == 'Racing'] + apps[apps['Category'] == 'Role Playing'] + apps[apps['Category'] == 'Shopping'] + apps[apps['Category'] == 'Trivia'] + apps[apps['Category'] == 'Video Players & Editors'] 
            
apps.drop(less_apps.index, axis=0, inplace=True)            

In [4]:
cat = apps.groupby('Category')
cat['Category'].count()

Category
Adventure             23196
Art & Design          18538
Auto & Vehicles       18278
Board                 10588
Books & Reference    116726
Business             143761
Card                   8179
Casual                50797
Communication         48159
Education            241075
Entertainment        138268
Events                12839
Finance               65456
Food & Drink          73920
Health & Fitness      83501
House & Home          14369
Libraries & Demo       5196
Lifestyle            118324
Maps & Navigation     26722
Medical               32063
Music                  4202
Music & Audio        154898
News & Magazines      42804
Parenting              3810
Personalization       89210
Photography           35552
Productivity          79686
Simulation            23276
Social                44729
Sports                47478
Strategy               8525
Tools                143976
Travel & Local        67282
Weather                7245
Word                   8630
Name: Categ

In [5]:
# Datensatz random auf die Hälfte reduzieren und neues DataFrame erstellen
half_apps = apps.sample(frac = 0.5)

In [6]:
# Alle Spalten mit Unique-Werten werden gedropped - zu viel Rechenkapa notwendig
# apps_v2 = apps.copy()
half_apps.drop(columns=['App Name', 'App Id', 'Developer Id', 'Developer Website','Minimum Android', 'Developer Email', 'Privacy Policy', 'Released', 'Scraped Time', 'Last Updated'], inplace=True)

In [7]:
# Umwandlung in Float-Werte
half_apps['Free']             = half_apps['Free'].astype(float)
half_apps['Ad Supported']     = half_apps['Ad Supported'].astype(float)
half_apps['Editors Choice']   = half_apps['Editors Choice'].astype(float)
half_apps['In App Purchases'] = half_apps['In App Purchases'].astype(float)
half_apps['Maximum Installs'] = half_apps['Maximum Installs'].astype(float)

In [8]:
half_apps.dtypes

Category             object
Rating              float64
Rating Count        float64
Installs             object
Minimum Installs    float64
Maximum Installs    float64
Free                float64
Price               float64
Currency             object
Size                float64
Content Rating       object
Ad Supported        float64
In App Purchases    float64
Editors Choice      float64
Released Year       float64
dtype: object

In [9]:
# Selektion von den Spalten vom Typ object
half_apps = half_apps.dropna()
half_apps.select_dtypes(include=['object'])

Unnamed: 0,Category,Installs,Currency,Content Rating
1032252,Entertainment,"10,000+",USD,Everyone
947580,Music & Audio,"100,000+",USD,Everyone
1431606,Sports,100+,USD,Everyone
838751,Casual,"1,000,000+",USD,Everyone
1068578,Lifestyle,"1,000+",USD,Everyone
...,...,...,...,...
905680,Tools,"10,000+",USD,Everyone
1404030,Music & Audio,50+,USD,Everyone
595624,Education,"50,000+",USD,Everyone
86078,Casual,10+,USD,Everyone


In [10]:
# Aufteilung in Listen mit numerischen und mit noch kategorischen Werten
numerical_cols = list(half_apps.select_dtypes(include="float").columns)
categorical_cols = list(half_apps.select_dtypes(include="object").columns)

In [11]:
# Löschen von Category, da dies dann als Zielklasse verwendet werden soll
categorical_cols.remove("Category")
categorical_cols

['Installs', 'Currency', 'Content Rating']

In [12]:
# Da Klassifikation nur mit numerischen Daten funktioniert, werden mittels
# One-Hot-Endcoding aus den kategorischen Spalten, numerische Daten generiert
X_dumm = pd.get_dummies(half_apps[categorical_cols])

In [13]:
X_dumm.head(2)

Unnamed: 0,Installs_0+,Installs_1+,"Installs_1,000+","Installs_1,000,000+","Installs_1,000,000,000+",Installs_10+,"Installs_10,000+","Installs_10,000,000+","Installs_10,000,000,000+",Installs_100+,...,Currency_SGD,Currency_USD,Currency_VND,Currency_XXX,Content Rating_Adults only 18+,Content Rating_Everyone,Content Rating_Everyone 10+,Content Rating_Mature 17+,Content Rating_Teen,Content Rating_Unrated
1032252,0,0,0,0,0,0,1,0,0,0,...,0,1,0,0,0,1,0,0,0,0
947580,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0


In [14]:
# Zusammenfügen beider numerischen Listen
X = pd.concat([half_apps[numerical_cols], X_dumm], axis = 1)

In [15]:
X.head()

Unnamed: 0,Rating,Rating Count,Minimum Installs,Maximum Installs,Free,Price,Size,Ad Supported,In App Purchases,Editors Choice,...,Currency_SGD,Currency_USD,Currency_VND,Currency_XXX,Content Rating_Adults only 18+,Content Rating_Everyone,Content Rating_Everyone 10+,Content Rating_Mature 17+,Content Rating_Teen,Content Rating_Unrated
1032252,3.5,56.0,10000.0,11426.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0,1,0,0,0,1,0,0,0,0
947580,3.7,402.0,100000.0,130419.0,1.0,0.0,14.0,1.0,0.0,0.0,...,0,1,0,0,0,1,0,0,0,0
1431606,0.0,0.0,100.0,165.0,1.0,0.0,11.0,0.0,0.0,0.0,...,0,1,0,0,0,1,0,0,0,0
838751,3.9,13013.0,1000000.0,4066159.0,1.0,0.0,7.5,1.0,0.0,0.0,...,0,1,0,0,0,1,0,0,0,0
1068578,3.3,12.0,1000.0,2190.0,1.0,0.0,4.0,1.0,0.0,0.0,...,0,1,0,0,0,1,0,0,0,0


In [16]:
y = half_apps['Category'] #.copy()
print(f"X und y haben gleiche Anzahl: {X.shape[0] == y.shape[0]}")

X und y haben gleiche Anzahl: True


In [17]:
label_encoder = LabelEncoder()

In [18]:
y = label_encoder.fit_transform(y) # macht alles zu 0, 1, 2,3 ...

In [19]:
# Daten in Trainings- und Test aufteilen
X_train, X_test1, y_train, y_test1 = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Testdaten in Entwicklung und echten Test aufteilen (50-50, stratifiziert)
X_dev, X_test, y_dev, y_test = train_test_split(X_test1, y_test1, test_size=0.5, stratify=y_test1, random_state=42)

- Gridsearch durchführen
- manuelle Vorhersagen ebenso durchführen

In [22]:
# Parameter des Algorithmus setzen
# Pipeline neu definieren - ohne Parametersetzen 

feature_selection=SelectFromModel(LinearSVC(penalty="l1", dual=False))
classifier = XGBClassifier()

pipeline = Pipeline([('feature_selection', feature_selection), ('classifier', classifier)])

# Parameterraum definieren: key ist schrittname__parametername, value die zu prüfenden Werte

parameters = {  
    'feature_selection__threshold': (None, 'mean'), 
    'classifier__booster': ('gbtree','gblinear'),
}

# Suche über den gesamten Parameterraum (cross validation über die Trainingsdaten)
grid_search = GridSearchCV(pipeline, param_grid=parameters, verbose=10)

grid_search.fit(X_train[:50000], y_train[:50000])

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV 1/5; 1/4] START classifier__booster=gbtree, feature_selection__threshold=None




[CV 1/5; 1/4] END classifier__booster=gbtree, feature_selection__threshold=None;, score=0.207 total time=13.3min
[CV 2/5; 1/4] START classifier__booster=gbtree, feature_selection__threshold=None




[CV 2/5; 1/4] END classifier__booster=gbtree, feature_selection__threshold=None;, score=0.197 total time=13.7min
[CV 3/5; 1/4] START classifier__booster=gbtree, feature_selection__threshold=None




[CV 3/5; 1/4] END classifier__booster=gbtree, feature_selection__threshold=None;, score=0.209 total time=13.3min
[CV 4/5; 1/4] START classifier__booster=gbtree, feature_selection__threshold=None




[CV 4/5; 1/4] END classifier__booster=gbtree, feature_selection__threshold=None;, score=0.205 total time=13.4min
[CV 5/5; 1/4] START classifier__booster=gbtree, feature_selection__threshold=None




[CV 5/5; 1/4] END classifier__booster=gbtree, feature_selection__threshold=None;, score=0.209 total time=11.9min
[CV 1/5; 2/4] START classifier__booster=gbtree, feature_selection__threshold=mean




[CV 1/5; 2/4] END classifier__booster=gbtree, feature_selection__threshold=mean;, score=0.149 total time= 7.9min
[CV 2/5; 2/4] START classifier__booster=gbtree, feature_selection__threshold=mean




[CV 2/5; 2/4] END classifier__booster=gbtree, feature_selection__threshold=mean;, score=0.147 total time= 8.6min
[CV 3/5; 2/4] START classifier__booster=gbtree, feature_selection__threshold=mean




[CV 3/5; 2/4] END classifier__booster=gbtree, feature_selection__threshold=mean;, score=0.147 total time= 8.9min
[CV 4/5; 2/4] START classifier__booster=gbtree, feature_selection__threshold=mean




[CV 4/5; 2/4] END classifier__booster=gbtree, feature_selection__threshold=mean;, score=0.151 total time= 8.4min
[CV 5/5; 2/4] START classifier__booster=gbtree, feature_selection__threshold=mean




[CV 5/5; 2/4] END classifier__booster=gbtree, feature_selection__threshold=mean;, score=0.147 total time= 8.5min
[CV 1/5; 3/4] START classifier__booster=gblinear, feature_selection__threshold=None




[CV 1/5; 3/4] END classifier__booster=gblinear, feature_selection__threshold=None;, score=0.161 total time= 2.5min
[CV 2/5; 3/4] START classifier__booster=gblinear, feature_selection__threshold=None




[CV 2/5; 3/4] END classifier__booster=gblinear, feature_selection__threshold=None;, score=0.164 total time= 2.4min
[CV 3/5; 3/4] START classifier__booster=gblinear, feature_selection__threshold=None




[CV 3/5; 3/4] END classifier__booster=gblinear, feature_selection__threshold=None;, score=0.166 total time= 2.5min
[CV 4/5; 3/4] START classifier__booster=gblinear, feature_selection__threshold=None




[CV 4/5; 3/4] END classifier__booster=gblinear, feature_selection__threshold=None;, score=0.168 total time= 2.6min
[CV 5/5; 3/4] START classifier__booster=gblinear, feature_selection__threshold=None




[CV 5/5; 3/4] END classifier__booster=gblinear, feature_selection__threshold=None;, score=0.165 total time= 2.5min
[CV 1/5; 4/4] START classifier__booster=gblinear, feature_selection__threshold=mean




[CV 1/5; 4/4] END classifier__booster=gblinear, feature_selection__threshold=mean;, score=0.142 total time= 2.0min
[CV 2/5; 4/4] START classifier__booster=gblinear, feature_selection__threshold=mean




[CV 2/5; 4/4] END classifier__booster=gblinear, feature_selection__threshold=mean;, score=0.142 total time= 1.9min
[CV 3/5; 4/4] START classifier__booster=gblinear, feature_selection__threshold=mean




[CV 3/5; 4/4] END classifier__booster=gblinear, feature_selection__threshold=mean;, score=0.147 total time= 1.9min
[CV 4/5; 4/4] START classifier__booster=gblinear, feature_selection__threshold=mean




[CV 4/5; 4/4] END classifier__booster=gblinear, feature_selection__threshold=mean;, score=0.143 total time= 1.8min
[CV 5/5; 4/4] START classifier__booster=gblinear, feature_selection__threshold=mean




[CV 5/5; 4/4] END classifier__booster=gblinear, feature_selection__threshold=mean;, score=0.141 total time= 1.9min






GridSearchCV(estimator=Pipeline(steps=[('feature_selection',
                                        SelectFromModel(estimator=LinearSVC(dual=False,
                                                                            penalty='l1'))),
                                       ('classifier',
                                        XGBClassifier(base_score=None,
                                                      booster=None,
                                                      colsample_bylevel=None,
                                                      colsample_bynode=None,
                                                      colsample_bytree=None,
                                                      enable_categorical=False,
                                                      gamma=None, gpu_id=None,
                                                      importance_type=None,
                                                      interaction_constraints=None,
              

In [23]:
# Welche Parameterkombination ist die beste?
print(grid_search.best_estimator_)

Pipeline(steps=[('feature_selection',
                 SelectFromModel(estimator=LinearSVC(dual=False,
                                                     penalty='l1'))),
                ('classifier',
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, enable_categorical=False,
                               gamma=0, gpu_id=-1, importance_type=None,
                               interaction_constraints='',
                               learning_rate=0.300000012, max_delta_step=0,
                               max_depth=6, min_child_weight=1, missing=nan,
                               monotone_constraints='()', n_estimators=100,
                               n_jobs=8, num_parallel_tree=1,
                               objective='multi:softprob', predictor='auto',
                               random_state=0, reg_alpha=0, reg_lambda=1,
        

In [25]:
# Pipeline für die beste Feature-Kombination definieren
# Parameter aus dem .best_estimator Ergebnis entnehmen
final_pipeline = Pipeline(steps=[('feature_selection',
                 SelectFromModel(estimator=LinearSVC(dual=False,
                                                     penalty='l1'))),
                ('classifier',
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, enable_categorical=False,
                               gamma=0, gpu_id=-1, importance_type=None,
                               interaction_constraints='',
                               learning_rate=0.300000012, max_delta_step=0,
                               max_depth=6, min_child_weight=1, 
                               monotone_constraints='()', n_estimators=100,
                               n_jobs=8, num_parallel_tree=1,
                               objective='multi:softprob', predictor='auto',
                               random_state=0, reg_alpha=0, reg_lambda=1,
                               scale_pos_weight=None, subsample=1,
                               tree_method='exact', validate_parameters=1,
                               verbosity=None))])

train_labels = cross_val_predict(final_pipeline, X_train[:50000] , y_train[:50000], cv=10)

# Precision/Recall/F-Wert berechnen

print(classification_report(y_train[:50000], train_labels[:50000]))







































              precision    recall  f1-score   support

           0       0.19      0.10      0.13       572
           1       0.07      0.00      0.01       467
           2       0.06      0.00      0.01       405
           3       0.03      0.01      0.01       287
           4       0.21      0.23      0.22      2955
           5       0.17      0.41      0.24      3546
           6       0.07      0.02      0.04       210
           7       0.17      0.18      0.18      1233
           8       0.15      0.03      0.05      1140
           9       0.19      0.38      0.25      6060
          10       0.17      0.18      0.17      3444
          11       0.02      0.00      0.01       344
          12       0.21      0.10      0.14      1592
          13       0.24      0.16      0.19      1826
          14       0.32      0.11      0.16      2049
          15       0.00      0.00      0.00       353
          16       0.00      0.00      0.00       127
          17       0.23    

### Vorhersagen auf den Trainingsdaten

In [27]:
# Parameter des Algorithmus setzen
pipeline1 = Pipeline(steps=[('feature_selection',
                 SelectFromModel(estimator=LinearSVC(dual=False,
                                                     penalty='l1'))),
                ('classifier',
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, enable_categorical=False,
                               gamma=0, gpu_id=-1, importance_type=None,
                               interaction_constraints='',
                               learning_rate=0.300000012, max_delta_step=0,
                               max_depth=6, min_child_weight=1, 
                               monotone_constraints='()', n_estimators=100,
                               n_jobs=8, num_parallel_tree=1,
                               objective='multi:softprob', predictor='auto',
                               random_state=0, reg_alpha=0, reg_lambda=1,
                               scale_pos_weight=None, subsample=1,
                               tree_method='exact', validate_parameters=1,
                               verbosity=None))])

pipeline1.fit(X_train[:50000], y_train[:50000])

train_labels1 = pipeline1.predict(X_train[:50000])
      
print(classification_report(y_train[:50000], train_labels1[:50000]))



              precision    recall  f1-score   support

           0       0.74      0.40      0.52       572
           1       0.75      0.08      0.14       467
           2       0.84      0.12      0.21       405
           3       0.80      0.26      0.39       287
           4       0.36      0.37      0.36      2955
           5       0.23      0.57      0.33      3546
           6       0.83      0.42      0.56       210
           7       0.39      0.43      0.41      1233
           8       0.67      0.16      0.26      1140
           9       0.26      0.52      0.35      6060
          10       0.31      0.32      0.31      3444
          11       0.67      0.12      0.20       344
          12       0.47      0.24      0.32      1592
          13       0.37      0.25      0.30      1826
          14       0.58      0.21      0.30      2049
          15       0.73      0.10      0.18       353
          16       0.90      0.20      0.33       127
          17       0.41    

### Vorhersage auf den Entwicklungsdaten --> X_dev

In [29]:
# Parameter des Algorithmus setzen
pipeline5 = Pipeline(steps=[('feature_selection',
                 SelectFromModel(estimator=LinearSVC(dual=False,
                                                     penalty='l1'))),
                ('classifier',
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, enable_categorical=False,
                               gamma=0, gpu_id=-1, importance_type=None,
                               interaction_constraints='',
                               learning_rate=0.300000012, max_delta_step=0,
                               max_depth=6, min_child_weight=1, 
                               monotone_constraints='()', n_estimators=100,
                               n_jobs=8, num_parallel_tree=1,
                               objective='multi:softprob', predictor='auto',
                               random_state=0, reg_alpha=0, reg_lambda=1,
                               scale_pos_weight=None, subsample=1,
                               tree_method='exact', validate_parameters=1,
                               verbosity=None))])
pipeline5.fit(X_train[:50000], y_train[:50000])

print("Default-Score des Klassifizierers: Accuracy=",pipeline5.score(X_test1, y_test1), "\n")

train_labels5 = pipeline5.predict(X_dev)
      
print(classification_report(y_dev, train_labels5))



Default-Score des Klassifizierers: Accuracy= 0.20610311307985726 

              precision    recall  f1-score   support

           0       0.19      0.09      0.12      1123
           1       0.04      0.00      0.00       907
           2       0.05      0.00      0.01       847
           3       0.02      0.00      0.01       523
           4       0.21      0.22      0.22      5766
           5       0.17      0.41      0.24      6837
           6       0.09      0.03      0.05       397
           7       0.19      0.20      0.19      2438
           8       0.18      0.03      0.05      2313
           9       0.19      0.38      0.25     11781
          10       0.17      0.18      0.18      6759
          11       0.05      0.01      0.01       623
          12       0.21      0.11      0.14      3105
          13       0.25      0.17      0.20      3566
          14       0.34      0.13      0.19      4025
          15       0.17      0.01      0.02       681
          16  

### Auswertung Trainings- vs. Testdaten
- grundsätzlich haben die Trainingsdaten eine höhere Accuracy als die Entwicklungs- bzw. Testdaten
----
Trotz zufälliger Halbierung des Datensatzes, Eliminierung von NaN-Werten und Eingrenzung der Zielklassen, können aufgrund der zahlreichen Datensätze keine Schlüsse gezogen werden. Die Accuracy liegt bei allen Vorhersagen recht niedrig, da bei vielen Zielklassen die Precision und F-score `0.0` bewertet wurden und zu den jeweiligen Zielklassen keine aussagekräftigen Trainings- oder Testdaten zugeordnet werden konnten.