### Przykładowe problemy związane ze skalowalnością zadań ML:

Ograniczenie CPU: Dane miesza sie w pamieci RAM, ale proces uczenia trwa za dlugo. Np. W przypadku koniecznosci sprawdzenia wielu kombinacji parametrow modelu, wielu modeli, itd. 


Ograniczenia pamieci: Dane sa na tyle duze ze nie mieszcza sie w pamieci RAM.




In [None]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://spark.apache.org/docs/latest/img/ml-Pipeline.png")

In [None]:
Image(url= "https://spark.apache.org/docs/latest/img/ml-PipelineModel.png")

### Potok przetwarzania ML

* <b>DataFrame</b>: ten interfejs API ML używa DataFrame ze Spark SQL jako zestawu danych ML, który może przechowywać różne typy danych. Na przykład DataFrame może mieć różne kolumny przechowujące tekst, wektory cech, prawdziwe etykiety i prognozy.


* <b>Transformer</b>: Transformator to algorytm, który może przekształcić jedną ramkę danych w inną ramkę danych. Na przykład model ML to transformator, który przekształca ramkę danych z funkcjami w ramkę danych z prognozami.


* <b>Estimator</b>: Estimator to algorytm, który można dopasować do DataFrame w celu wytworzenia transformatora. Np. Algorytm uczenia się jest estymatorem, który trenuje na DataFrame i tworzy model.


* <b>Pipeline</b>: Rurociąg łączy wiele transformatorów i estymatorów razem, aby określić przepływ pracy ML.


* <b>Parametr</b>: Wszystkie transformatory i estymatory mają teraz wspólny interfejs API do określania parametrów.

In [2]:
import os
user_name = os.environ.get('USER')
print(user_name)

import random
ui_port = random.randint(4000,4999)
print(ui_port)

from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.master('yarn-client') \
.config('spark.driver.memory','1g') \
.config('spark.executor.memory', '2g') \
.config('spark.ui.port',f'{ui_port}') \
.appName(f'ds-{user_name}') \
.getOrCreate()

iagasz
4780


In [3]:
path = f'/user/{user_name}/survey/data/survey_results_public.csv'

In [4]:
db_name = user_name.replace('-','_')
table_name = "survey"

In [5]:
spark.sql(f'DROP DATABASE IF EXISTS {db_name} CASCADE')
spark.sql(f'CREATE DATABASE {db_name} LOCATION "/edugen/db/{db_name}"')
spark.sql(f'USE {db_name}')

DataFrame[]

In [6]:
spark.sql(f'DROP TABLE IF EXISTS {table_name}')

spark.sql(f'CREATE TABLE IF NOT EXISTS {table_name} \
          USING csv \
          OPTIONS (HEADER true, INFERSCHEMA true, NULLVALUE "NA") \
          LOCATION "{path}"')

DataFrame[]

In [11]:
spark.sql(f'describe {table_name}').show(100)

+--------------------+-------------+-------+
|            col_name|    data_type|comment|
+--------------------+-------------+-------+
|          Respondent|          int|   null|
|          MainBranch|       string|   null|
|            Hobbyist|       string|   null|
|         OpenSourcer|       string|   null|
|          OpenSource|       string|   null|
|          Employment|       string|   null|
|             Country|       string|   null|
|             Student|       string|   null|
|             EdLevel|       string|   null|
|      UndergradMajor|       string|   null|
|            EduOther|       string|   null|
|             OrgSize|       string|   null|
|             DevType|       string|   null|
|           YearsCode|       string|   null|
|          Age1stCode|       string|   null|
|        YearsCodePro|       string|   null|
|           CareerSat|       string|   null|
|              JobSat|       string|   null|
|            MgrIdiot|       string|   null|
|         

### Przygotowanie danych do analizy

* W ramach zadania chcemy stworzyc klasyfikator, ktory bedzie przewidywac czy respondend zarabia wiecej niz 60000 USD rocznie

In [12]:
spark_df= spark.sql(f'select *,cast((convertedComp > 60000) as string) as compAboveAvg \
                    from {table_name} where convertedComp is not null ')
spark_df.limit(5).toPandas()

Unnamed: 0,Respondent,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,...,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase,compAboveAvg
0,3,"I am not primarily a developer, but I write co...",Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Thailand,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Web development or web design,...,Tech meetups or events in your area;Courses on...,28.0,Man,No,Straight / Heterosexual,,Yes,Appropriate in length,Neither easy nor difficult,False
1,4,I am a developer by profession,No,Never,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",...,Tech articles written by other developers;Indu...,22.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy,True
2,6,"I am not primarily a developer, but I write co...",Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Canada,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Mathematics or statistics,...,Tech articles written by other developers;Indu...,28.0,Man,No,Straight / Heterosexual,East Asian,No,Too long,Neither easy nor difficult,True
3,9,I am a developer by profession,Yes,Once a month or more often,The quality of OSS and closed source software ...,Employed full-time,New Zealand,No,Some college/university study without earning ...,"Computer science, computer engineering, or sof...",...,,23.0,Man,No,Bisexual,White or of European descent,No,Appropriate in length,Neither easy nor difficult,True
4,10,I am a developer by profession,Yes,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,India,No,"Master’s degree (MA, MS, M.Eng., MBA, etc.)",,...,Tech articles written by other developers;Tech...,,,,,,Yes,Too long,Difficult,False


Dążymy do tego, żeby przygotować jeden wektor cech oraz jedną kolumnę z oznaczeniami. 

Pierwszy krok: feature extraction: kodujemy kolumny tekstowe na numeryczne, kodujemy wartosci liczbowe na reprezentacje onehotencoder. Nastepnie dokonujemy asemblacji do jednego wektora

In [13]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml import Pipeline
# chcemy przewidziec compAboveAvg
y = 'compAboveAvg'
# na podstawie:
feature_columns = ['OpSys', 'EdLevel', 'MainBranch' , 'Country', 'Student', 'YearsCode']

In [14]:
#Zaczynamy od transformatora StringIndexer, zamieniajacego wartosci 'string' na liczbe

##### najpierw pokazujemy prosta petle z FOR, a potem zrefactorujmy do list comprehension

# dla cech, ktore zostana wykorzystane do predykcji

stringindexer_stages_1 = []
for c in feature_columns:
    stringindexer_stages_1.append (StringIndexer(inputCol=c, outputCol='strindexed_' + c).setHandleInvalid("keep"))


# i dla zmiennej objasnianej
stringindexer_stages_1.append(StringIndexer(inputCol=y, outputCol='label').setHandleInvalid("keep"))


In [15]:
# Refactoring do list comprehension

stringindexer_stages = [StringIndexer(inputCol=c, outputCol='strindexed_' + c).setHandleInvalid("keep") for c in feature_columns]

# i dla zmiennej objasnianej
stringindexer_stages += [StringIndexer(inputCol=y, outputCol='label').setHandleInvalid("keep")]
stringindexer_stages

[StringIndexer_24ac71642d2f,
 StringIndexer_911735fc81ec,
 StringIndexer_1061c93662f7,
 StringIndexer_44fe42fa5249,
 StringIndexer_cb2b0832fcf9,
 StringIndexer_63969296a928,
 StringIndexer_71675096026f]

In [16]:
# Po wykonaniu takiej transformacji do DF zostaje dodane  7 nowych kolumn z prefixem "strindexed_"
Pipeline(stages=stringindexer_stages).fit(spark_df).transform(spark_df).toPandas()

Unnamed: 0,Respondent,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,...,SurveyLength,SurveyEase,compAboveAvg,strindexed_OpSys,strindexed_EdLevel,strindexed_MainBranch,strindexed_Country,strindexed_Student,strindexed_YearsCode,label
0,3,"I am not primarily a developer, but I write co...",Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Thailand,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Web development or web design,...,Appropriate in length,Neither easy nor difficult,false,2.0,0.0,1.0,57.0,0.0,10.0,0.0
1,4,I am a developer by profession,No,Never,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",...,Appropriate in length,Easy,true,0.0,0.0,0.0,0.0,0.0,10.0,1.0
2,6,"I am not primarily a developer, but I write co...",Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Canada,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Mathematics or statistics,...,Too long,Neither easy nor difficult,true,0.0,0.0,1.0,4.0,0.0,13.0,1.0
3,9,I am a developer by profession,Yes,Once a month or more often,The quality of OSS and closed source software ...,Employed full-time,New Zealand,No,Some college/university study without earning ...,"Computer science, computer engineering, or sof...",...,Appropriate in length,Neither easy nor difficult,true,1.0,2.0,0.0,31.0,0.0,8.0,1.0
4,10,I am a developer by profession,Yes,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,India,No,"Master’s degree (MA, MS, M.Eng., MBA, etc.)",,...,Too long,Difficult,false,0.0,1.0,0.0,2.0,0.0,8.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
55818,88878,I am a developer by profession,Yes,Less than once per year,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",...,Appropriate in length,Easy,true,1.0,0.0,0.0,0.0,0.0,8.0,1.0
55819,88879,I am a developer by profession,Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Finland,No,"Master’s degree (MA, MS, M.Eng., MBA, etc.)","Computer science, computer engineering, or sof...",...,Appropriate in length,Easy,true,0.0,1.0,0.0,30.0,0.0,18.0,1.0
55820,88881,I am a developer by profession,Yes,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,Austria,No,"Master’s degree (MA, MS, M.Eng., MBA, etc.)","Computer science, computer engineering, or sof...",...,Appropriate in length,Easy,true,1.0,1.0,0.0,17.0,0.0,14.0,1.0
55821,88882,I am a developer by profession,Yes,Never,"OSS is, on average, of LOWER quality than prop...",Employed full-time,Netherlands,"Yes, full-time","Master’s degree (MA, MS, M.Eng., MBA, etc.)","Computer science, computer engineering, or sof...",...,Too long,Easy,true,2.0,1.0,0.0,10.0,1.0,0.0,1.0


In [17]:
onehotencoder_stages = [OneHotEncoder(inputCol='strindexed_' + c, outputCol='onehot_' + c) for c in feature_columns]

In [18]:
# Rozbudowujemy pipeline..
#Po wykonaniu takiej transformacji do DF zostaje dodane  6 nowych kolumn z prefixem "onehot_". SparseV
pa = Pipeline(stages=stringindexer_stages + onehotencoder_stages).fit(spark_df).transform(spark_df).toPandas()

In [19]:
pa.columns

Index(['Respondent', 'MainBranch', 'Hobbyist', 'OpenSourcer', 'OpenSource',
       'Employment', 'Country', 'Student', 'EdLevel', 'UndergradMajor',
       'EduOther', 'OrgSize', 'DevType', 'YearsCode', 'Age1stCode',
       'YearsCodePro', 'CareerSat', 'JobSat', 'MgrIdiot', 'MgrMoney',
       'MgrWant', 'JobSeek', 'LastHireDate', 'LastInt', 'FizzBuzz',
       'JobFactors', 'ResumeUpdate', 'CurrencySymbol', 'CurrencyDesc',
       'CompTotal', 'CompFreq', 'ConvertedComp', 'WorkWeekHrs', 'WorkPlan',
       'WorkChallenge', 'WorkRemote', 'WorkLoc', 'ImpSyn', 'CodeRev',
       'CodeRevHrs', 'UnitTests', 'PurchaseHow', 'PurchaseWhat',
       'LanguageWorkedWith', 'LanguageDesireNextYear', 'DatabaseWorkedWith',
       'DatabaseDesireNextYear', 'PlatformWorkedWith',
       'PlatformDesireNextYear', 'WebFrameWorkedWith',
       'WebFrameDesireNextYear', 'MiscTechWorkedWith',
       'MiscTechDesireNextYear', 'DevEnviron', 'OpSys', 'Containers',
       'BlockchainOrg', 'BlockchainIs', 'BetterLife'

In [20]:
# Nowe kolumny zawieraja wartosci typu SparseVector zawierajacy mape bitowa.
pa['onehot_OpSys'].unique()

array([SparseVector(4, {2: 1.0}), SparseVector(4, {0: 1.0}),
       SparseVector(4, {1: 1.0}), SparseVector(4, {}),
       SparseVector(4, {3: 1.0})], dtype=object)

In [21]:
# Polaczenie wszystkich kolumn predykcyjnych do jednej (features) ASEMBLACJA
extracted_columns = ['onehot_' + c for c in feature_columns]
vectorassembler_stage = VectorAssembler(inputCols=extracted_columns, outputCol='features') 

In [22]:
# Polaczenie wszystkich krokow przygotowania danych w jednym potoku przetwarzania
final_columns = [y] + feature_columns + extracted_columns + ['features', 'label']

final_columns

['compAboveAvg',
 'OpSys',
 'EdLevel',
 'MainBranch',
 'Country',
 'Student',
 'YearsCode',
 'onehot_OpSys',
 'onehot_EdLevel',
 'onehot_MainBranch',
 'onehot_Country',
 'onehot_Student',
 'onehot_YearsCode',
 'features',
 'label']

In [23]:
transformed_df = Pipeline(stages=stringindexer_stages + \
                          onehotencoder_stages + \
                          [vectorassembler_stage]).fit(spark_df).transform(spark_df).select(final_columns)

transformed_df.limit(5).toPandas()

Unnamed: 0,compAboveAvg,OpSys,EdLevel,MainBranch,Country,Student,YearsCode,onehot_OpSys,onehot_EdLevel,onehot_MainBranch,onehot_Country,onehot_Student,onehot_YearsCode,features,label
0,False,Linux-based,"Bachelor’s degree (BA, BS, B.Eng., etc.)","I am not primarily a developer, but I write co...",Thailand,No,3,"(0.0, 0.0, 1.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(0.0, 1.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...",0.0
1,True,Windows,"Bachelor’s degree (BA, BS, B.Eng., etc.)",I am a developer by profession,United States,No,3,"(1.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...",1.0
2,True,Windows,"Bachelor’s degree (BA, BS, B.Eng., etc.)","I am not primarily a developer, but I write co...",Canada,No,13,"(1.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(0.0, 1.0)","(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...",1.0
3,True,MacOS,Some college/university study without earning ...,I am a developer by profession,New Zealand,No,12,"(0.0, 1.0, 0.0, 0.0)","(0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, ...","(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...",1.0
4,False,Windows,"Master’s degree (MA, MS, M.Eng., MBA, etc.)",I am a developer by profession,India,No,12,"(1.0, 0.0, 0.0, 0.0)","(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, ...","(1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...",0.0


### Podzial na zbior treningowy/testowy

In [33]:
training, test = transformed_df.randomSplit([0.8, 0.2], seed=1234)

In [34]:
training.count()

44540

### Uczenie modelu - model.fit()

In [35]:
# na poczatek wybierzemy drzewo decyzyjne. Nie musimy podawac zadnych parametrow
from pyspark.ml.classification import LogisticRegression, DecisionTreeClassifier
dt = DecisionTreeClassifier(featuresCol='features', labelCol='label')

In [36]:
simple_model = Pipeline(stages=[dt]).fit(training)

In [37]:
simple_model.stages[0]

DecisionTreeClassificationModel (uid=DecisionTreeClassifier_352c14566a3a) of depth 5 with 53 nodes

### Predykcja - model.transform()

In [38]:
pred_simple = simple_model.transform(test)

In [39]:
show_columns = final_columns + ['prediction', 'rawPrediction', 'probability']
pred_simple.limit(5).select(show_columns).toPandas()

Unnamed: 0,compAboveAvg,OpSys,EdLevel,MainBranch,Country,Student,YearsCode,onehot_OpSys,onehot_EdLevel,onehot_MainBranch,onehot_Country,onehot_Student,onehot_YearsCode,features,label,prediction,rawPrediction,probability
0,False,,,I am a developer by profession,Germany,No,10,"(0.0, 0.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,1.0,"[1002.0, 1464.0, 0.0]","[0.40632603406326034, 0.5936739659367397, 0.0]"
1,False,,"Bachelor’s degree (BA, BS, B.Eng., etc.)",I am a developer by profession,India,No,10,"(0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,0.0,"[18385.0, 5162.0, 0.0]","[0.7807788677963222, 0.21922113220367775, 0.0]"
2,False,,"Bachelor’s degree (BA, BS, B.Eng., etc.)",I am a developer by profession,India,No,6,"(0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0)","(0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,0.0,"[18385.0, 5162.0, 0.0]","[0.7807788677963222, 0.21922113220367775, 0.0]"
3,False,,"Bachelor’s degree (BA, BS, B.Eng., etc.)",I am a developer by profession,India,"Yes, full-time",4,"(0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 1.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,0.0,"[18385.0, 5162.0, 0.0]","[0.7807788677963222, 0.21922113220367775, 0.0]"
4,False,,"Bachelor’s degree (BA, BS, B.Eng., etc.)",I am a developer by profession,United Kingdom,No,12,"(0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, ...","(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,1.0,"[1011.0, 1721.0, 0.0]","[0.37005856515373353, 0.6299414348462665, 0.0]"


## Ewaluacje

In [40]:
label_and_pred = pred_simple.select('label', 'prediction')
label_and_pred.groupBy('label', 'prediction').count().toPandas()

Unnamed: 0,label,prediction,count
0,1.0,1.0,3916
1,0.0,1.0,1023
2,1.0,0.0,1429
3,0.0,0.0,4915


In [41]:
# Ewaluator 
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction", metricName="areaUnderROC")

In [42]:
auroc_simple = evaluator.evaluate(pred_simple)
auroc_simple

0.6370645091262662

In [43]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator_m = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator_m.evaluate(pred_simple)
accuracy

0.7826819108393158

## Dodanie hiperparametrów 

In [44]:
# Jake wartosci hiperparametru maxDepth maja byc przetwstowane
from pyspark.ml.tuning import ParamGridBuilder
param_grid = ParamGridBuilder().\
    addGrid(dt.maxDepth, [2,3,4,5,6]).\
    build()

In [45]:
# Walidacja krzyrzowa wykonwyana w celu optymalizaji hiperparametrow
from pyspark.ml.tuning import CrossValidator
cv = CrossValidator(estimator=dt, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=4)

In [46]:
# Budowa modelu na podstawie danych treningowych
cv_model = cv.fit(training)

In [47]:
cv_model.bestModel

DecisionTreeClassificationModel (uid=DecisionTreeClassifier_352c14566a3a) of depth 3 with 13 nodes

## Predykcja z nowym modelem

In [48]:
# Jak wyglada predykcja na zbiorze danych treninigowych?
pred_cv = cv_model.transform(test)
show_columns = final_columns + ['prediction', 'rawPrediction', 'probability']
pred_cv.limit(5).select(show_columns).toPandas()

Unnamed: 0,compAboveAvg,OpSys,EdLevel,MainBranch,Country,Student,YearsCode,onehot_OpSys,onehot_EdLevel,onehot_MainBranch,onehot_Country,onehot_Student,onehot_YearsCode,features,label,prediction,rawPrediction,probability
0,False,,,I am a developer by profession,Afghanistan,,Less than 1 year,"(0.0, 0.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,0.0,"[20089.0, 7490.0, 0.0]","[0.7284165488233801, 0.2715834511766199, 0.0]"
1,False,,,I am a developer by profession,Algeria,,13,"(0.0, 0.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,0.0,"[20089.0, 7490.0, 0.0]","[0.7284165488233801, 0.2715834511766199, 0.0]"
2,False,,,I am a developer by profession,Canada,No,5,"(0.0, 0.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0)","(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,1.0,"[701.0, 1147.0, 0.0]","[0.37932900432900435, 0.6206709956709957, 0.0]"
3,False,,,"I am not primarily a developer, but I write co...",Spain,No,30,"(0.0, 0.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(0.0, 1.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,0.0,"[20089.0, 7490.0, 0.0]","[0.7284165488233801, 0.2715834511766199, 0.0]"
4,False,,"Bachelor’s degree (BA, BS, B.Eng., etc.)",I am a developer by profession,India,No,7,"(0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,0.0,"[20089.0, 7490.0, 0.0]","[0.7284165488233801, 0.2715834511766199, 0.0]"


In [49]:
# Confusion matrix
label_and_pred = pred_cv.select('label', 'prediction')
label_and_pred.groupBy('label', 'prediction').count().toPandas()

Unnamed: 0,label,prediction,count
0,1.0,1.0,3428
1,0.0,1.0,806
2,1.0,0.0,1917
3,0.0,0.0,5132


In [50]:
auroc_cv = evaluator.evaluate(pred_cv)
auroc_cv

0.6738679482182741

In [51]:
acc_cv = evaluator_m.evaluate(pred_cv)
acc_cv

0.7586634760258797

## Klasyfikacja za pomca Gradient Boosted Trees

In [53]:
from pyspark.ml.classification import GBTClassifier
gbt = GBTClassifier(labelCol="label", featuresCol="features", maxIter=10)
model = gbt.fit(training)

In [54]:
evaluator.evaluate(model.transform(test))

0.8806132814259979

## Zadania:

* Czy mozna jeszcze poprawic jakosc predykcji: 
    * a) dodajac cechy
    * b) zmieniajac model
    * c) lepiej dobierajac parametry modelu ? 

In [None]:
#Kod w R
#library(data.table)
#srv <- fread("survey_results_public.csv")
#srv$OpSys2 <- srv$OpSys == "Windows"
#library(rpart)
#srv$CompAboveAvg <- CompAboveAvg$ConvertedComp > 60e3
#dt_fit = rpart(CompAboveAvg ~ Age + EdLevel + Student + OpSys + YearsCode , data = srv, method = 'class')
#pred_y = predict(dt_fit, type = 'class')
#table(predict(dt_fit, srv[,c("Age" , "EdLevel", "Student", "OpSys", "YearsCode")], type = "class"), srv$CompAboveAvg)
#srv(cor)
