### Przykładowe problemy związane ze skalowalnością zadań ML:

Ograniczenie CPU: Dane miesza sie w pamieci RAM, ale proces uczenia trwa za dlugo. Np. W przypadku koniecznosci sprawdzenia wielu kombinacji parametrow modelu, wielu modeli, itd. 


Ograniczenia pamieci: Dane sa na tyle duze ze nie mieszcza sie w pamieci RAM.


#### Pipeline

![](ml-Pipeline.png)

#### Pipeline Model

![](ml-PipelineModel.png)

### Potok przetwarzania ML

* <b>DataFrame</b>: ten interfejs API ML używa DataFrame ze Spark SQL jako zestawu danych ML, który może przechowywać różne typy danych. Na przykład DataFrame może mieć różne kolumny przechowujące tekst, wektory cech, prawdziwe etykiety i prognozy.


* <b>Transformer</b>: Transformator to algorytm, który może przekształcić jedną ramkę danych w inną ramkę danych. Na przykład model ML to transformator, który przekształca ramkę danych z funkcjami w ramkę danych z prognozami.


* <b>Estimator</b>: Estimator to algorytm, który można dopasować do DataFrame w celu wytworzenia transformatora. Np. Algorytm uczenia się jest estymatorem, który trenuje na DataFrame i tworzy model.


* <b>Pipeline</b>: Rurociąg łączy wiele transformatorów i estymatorów razem, aby określić przepływ pracy ML.


* <b>Parametr</b>: Wszystkie transformatory i estymatory mają teraz wspólny interfejs API do określania parametrów.

In [3]:
import os
user_name = os.environ.get('USER')
print(user_name)

agaszmurlo


In [4]:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.config('spark.driver.memory','1g') \
.config('spark.executor.memory', '2g') \
.getOrCreate()

In [5]:
gs_path = f'gs://bucket-{user_name}/survey/2020/survey_results_public.csv'

In [6]:
db_name = user_name.replace('-','_')

In [7]:
spark.sql(f'DROP DATABASE IF EXISTS {db_name} CASCADE')
spark.sql(f'CREATE DATABASE {db_name}')
spark.sql(f'USE {db_name}')

DataFrame[]

In [10]:
table_name = "survey_2020"            

In [11]:
spark.sql(f'DROP TABLE IF EXISTS {table_name}')

spark.sql(f'CREATE TABLE IF NOT EXISTS {table_name} \
          USING csv \
          OPTIONS (HEADER true, INFERSCHEMA true, NULLVALUE "NA") \
          LOCATION "{gs_path}"')

DataFrame[]

In [12]:
spark.sql(f'describe {table_name}').show(100)

+--------------------+---------+-------+
|            col_name|data_type|comment|
+--------------------+---------+-------+
|          Respondent|      int|   null|
|          MainBranch|   string|   null|
|            Hobbyist|   string|   null|
|                 Age|   double|   null|
|          Age1stCode|   string|   null|
|            CompFreq|   string|   null|
|           CompTotal|   double|   null|
|       ConvertedComp|   double|   null|
|             Country|   string|   null|
|        CurrencyDesc|   string|   null|
|      CurrencySymbol|   string|   null|
|DatabaseDesireNex...|   string|   null|
|  DatabaseWorkedWith|   string|   null|
|             DevType|   string|   null|
|             EdLevel|   string|   null|
|          Employment|   string|   null|
|           Ethnicity|   string|   null|
|              Gender|   string|   null|
|          JobFactors|   string|   null|
|              JobSat|   string|   null|
|             JobSeek|   string|   null|
|LanguageDesireN

### Przygotowanie danych do analizy

W ramach zadania chcemy stworzyc klasyfikator, ktory bedzie przewidywac czy respondent zarabia wiecej niz 60000 USD rocznie

In [14]:
spark_df= spark.sql(f'SELECT *, CAST((convertedComp > 60000) AS STRING) AS compAboveAvg \
                    FROM {table_name} where convertedComp IS NOT NULL ')
spark_df.limit(5).toPandas()

Unnamed: 0,Respondent,MainBranch,Hobbyist,Age,Age1stCode,CompFreq,CompTotal,ConvertedComp,Country,CurrencyDesc,...,SurveyLength,Trans,UndergradMajor,WebframeDesireNextYear,WebframeWorkedWith,WelcomeChange,WorkWeekHrs,YearsCode,YearsCodePro,compAboveAvg
0,8,I am a developer by profession,Yes,36.0,12,Yearly,116000.0,116000.0,United States,United States dollar,...,Appropriate in length,No,"Computer science, computer engineering, or sof...",Django;React.js;Vue.js,Flask,Just as welcome now as I felt last year,39.0,17,13,True
1,10,I am a developer by profession,Yes,22.0,14,Yearly,25000.0,32315.0,United Kingdom,Pound sterling,...,Appropriate in length,No,Mathematics or statistics,Flask;jQuery,Flask;jQuery,Somewhat more welcome now than last year,36.0,8,4,False
2,11,I am a developer by profession,Yes,23.0,13,Yearly,31000.0,40070.0,United Kingdom,Pound sterling,...,Appropriate in length,No,"Computer science, computer engineering, or sof...",Angular;Django;React.js,Angular;Angular.js;Django;React.js,Just as welcome now as I felt last year,40.0,10,2,False
3,12,I am a developer by profession,No,49.0,42,Monthly,1100.0,14268.0,Spain,European Euro,...,Appropriate in length,No,Mathematics or statistics,ASP.NET;jQuery,ASP.NET;jQuery,Just as welcome now as I felt last year,40.0,7,7,False
4,13,"I am not primarily a developer, but I write co...",Yes,53.0,14,Monthly,3000.0,38916.0,Netherlands,European Euro,...,Too long,No,,,,A lot less welcome now than last year,36.0,35,20,False


Dążymy do tego, żeby przygotować jeden wektor cech oraz jedną kolumnę z oznaczeniami. 

Pierwszy krok: feature extraction: kodujemy kolumny tekstowe na numeryczne, kodujemy wartosci liczbowe na reprezentacje onehotencoder. Nastepnie dokonujemy asemblacji do jednego wektora

In [27]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml import Pipeline
# chcemy przewidziec compAboveAvg
y = 'compAboveAvg'
# na podstawie:
feature_columns = ['OpSys', 'EdLevel', 'MainBranch' , 'Country', 'JobSeek', 'YearsCode']

In [28]:
#Zaczynamy od transformatora StringIndexer, zamieniajacego wartosci 'string' na liczbe

##### najpierw pokazujemy prosta petle z FOR, a potem zrefactorujmy do list comprehension

# dla cech, ktore zostana wykorzystane do predykcji

stringindexer_stages_1 = []
for c in feature_columns:
    stringindexer_stages_1.append (StringIndexer(inputCol=c, outputCol='strindexed_' + c).setHandleInvalid("keep"))


# i dla zmiennej objasnianej
stringindexer_stages_1.append(StringIndexer(inputCol=y, outputCol='label').setHandleInvalid("keep"))


In [29]:
# Refactoring do list comprehension

stringindexer_stages = [StringIndexer(inputCol=c, outputCol='strindexed_' + c).setHandleInvalid("keep") for c in feature_columns]

# i dla zmiennej objasnianej
stringindexer_stages += [StringIndexer(inputCol=y, outputCol='label').setHandleInvalid("keep")]
stringindexer_stages

[StringIndexer_92ab85a737ab,
 StringIndexer_e7a217c72722,
 StringIndexer_9b721b83ebf9,
 StringIndexer_ebf15a33bbc7,
 StringIndexer_ed57b835d570,
 StringIndexer_97390462699b,
 StringIndexer_3a27c85e019a]

In [30]:
# Po wykonaniu takiej transformacji do DF zostaje dodane  7 nowych kolumn z prefixem "strindexed_"
Pipeline(stages=stringindexer_stages).fit(spark_df).transform(spark_df).toPandas()

Unnamed: 0,Respondent,MainBranch,Hobbyist,Age,Age1stCode,CompFreq,CompTotal,ConvertedComp,Country,CurrencyDesc,...,YearsCode,YearsCodePro,compAboveAvg,strindexed_OpSys,strindexed_EdLevel,strindexed_MainBranch,strindexed_Country,strindexed_JobSeek,strindexed_YearsCode,label
0,8,I am a developer by profession,Yes,36.0,12,Yearly,116000.0,116000.0,United States,United States dollar,...,17,13,true,2.0,0.0,0.0,0.0,0.0,17.0,1.0
1,10,I am a developer by profession,Yes,22.0,14,Yearly,25000.0,32315.0,United Kingdom,Pound sterling,...,8,4,false,0.0,1.0,0.0,2.0,0.0,1.0,0.0
2,11,I am a developer by profession,Yes,23.0,13,Yearly,31000.0,40070.0,United Kingdom,Pound sterling,...,10,2,false,0.0,0.0,0.0,2.0,2.0,0.0,0.0
3,12,I am a developer by profession,No,49.0,42,Monthly,1100.0,14268.0,Spain,European Euro,...,7,7,false,0.0,2.0,0.0,10.0,0.0,3.0,0.0
4,13,"I am not primarily a developer, but I write co...",Yes,53.0,14,Monthly,3000.0,38916.0,Netherlands,European Euro,...,35,20,false,1.0,3.0,1.0,7.0,1.0,24.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34751,65619,"I am not primarily a developer, but I write co...",Yes,,19,Monthly,30000.0,984.0,Nigeria,Nigerian naira,...,3,2,false,0.0,3.0,1.0,38.0,3.0,13.0,0.0
34752,65625,I am a developer by profession,Yes,,17,Monthly,5500000.0,19428.0,Colombia,Colombian peso,...,12,5,false,4.0,0.0,0.0,39.0,1.0,7.0,0.0
34753,65629,I am a developer by profession,Yes,41.0,15,Yearly,200.0,200.0,United States,United States dollar,...,25,20,false,1.0,2.0,0.0,0.0,0.0,14.0,0.0
34754,65630,I am a developer by profession,Yes,,17,Monthly,1000000.0,15048.0,Chile,Chilean peso,...,7,3,false,4.0,0.0,0.0,48.0,0.0,3.0,0.0


In [31]:
onehotencoder_stages = [OneHotEncoder(inputCol='strindexed_' + c, outputCol='onehot_' + c) for c in feature_columns]

In [32]:
# Rozbudowujemy pipeline..
#Po wykonaniu takiej transformacji do DF zostaje dodane  6 nowych kolumn z prefixem "onehot_". SparseV
pa = Pipeline(stages=stringindexer_stages + onehotencoder_stages).fit(spark_df).transform(spark_df).toPandas()

In [33]:
pa.columns

Index(['Respondent', 'MainBranch', 'Hobbyist', 'Age', 'Age1stCode', 'CompFreq',
       'CompTotal', 'ConvertedComp', 'Country', 'CurrencyDesc',
       'CurrencySymbol', 'DatabaseDesireNextYear', 'DatabaseWorkedWith',
       'DevType', 'EdLevel', 'Employment', 'Ethnicity', 'Gender', 'JobFactors',
       'JobSat', 'JobSeek', 'LanguageDesireNextYear', 'LanguageWorkedWith',
       'MiscTechDesireNextYear', 'MiscTechWorkedWith',
       'NEWCollabToolsDesireNextYear', 'NEWCollabToolsWorkedWith', 'NEWDevOps',
       'NEWDevOpsImpt', 'NEWEdImpt', 'NEWJobHunt', 'NEWJobHuntResearch',
       'NEWLearn', 'NEWOffTopic', 'NEWOnboardGood', 'NEWOtherComms',
       'NEWOvertime', 'NEWPurchaseResearch', 'NEWPurpleLink', 'NEWSOSites',
       'NEWStuck', 'OpSys', 'OrgSize', 'PlatformDesireNextYear',
       'PlatformWorkedWith', 'PurchaseWhat', 'Sexuality', 'SOAccount',
       'SOComm', 'SOPartFreq', 'SOVisitFreq', 'SurveyEase', 'SurveyLength',
       'Trans', 'UndergradMajor', 'WebframeDesireNextYear',
  

In [34]:
# Nowe kolumny zawieraja wartosci typu SparseVector zawierajacy mape bitowa.
pa['onehot_OpSys'].unique()

array([SparseVector(4, {2: 1.0}), SparseVector(4, {0: 1.0}),
       SparseVector(4, {1: 1.0}), SparseVector(4, {}),
       SparseVector(4, {3: 1.0})], dtype=object)

In [35]:
# Polaczenie wszystkich kolumn predykcyjnych do jednej (features) ASEMBLACJA
extracted_columns = ['onehot_' + c for c in feature_columns]
vectorassembler_stage = VectorAssembler(inputCols=extracted_columns, outputCol='features') 

In [36]:
# Polaczenie wszystkich krokow przygotowania danych w jednym potoku przetwarzania
final_columns = [y] + feature_columns + extracted_columns + ['features', 'label']

final_columns

['compAboveAvg',
 'OpSys',
 'EdLevel',
 'MainBranch',
 'Country',
 'JobSeek',
 'YearsCode',
 'onehot_OpSys',
 'onehot_EdLevel',
 'onehot_MainBranch',
 'onehot_Country',
 'onehot_JobSeek',
 'onehot_YearsCode',
 'features',
 'label']

In [37]:
transformed_df = Pipeline(stages=stringindexer_stages + \
                          onehotencoder_stages + \
                          [vectorassembler_stage]).fit(spark_df).transform(spark_df).select(final_columns)

transformed_df.limit(5).toPandas()

Unnamed: 0,compAboveAvg,OpSys,EdLevel,MainBranch,Country,JobSeek,YearsCode,onehot_OpSys,onehot_EdLevel,onehot_MainBranch,onehot_Country,onehot_JobSeek,onehot_YearsCode,features,label
0,True,Linux-based,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",I am a developer by profession,United States,"I’m not actively looking, but I am open to new...",17,"(0.0, 0.0, 1.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...",1.0
1,False,Windows,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",I am a developer by profession,United Kingdom,"I’m not actively looking, but I am open to new...",8,"(1.0, 0.0, 0.0, 0.0)","(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0)","(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...",0.0
2,False,Windows,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",I am a developer by profession,United Kingdom,I am actively looking for a job,10,"(1.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 1.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...",0.0
3,False,Windows,Some college/university study without earning ...,I am a developer by profession,Spain,"I’m not actively looking, but I am open to new...",7,"(1.0, 0.0, 0.0, 0.0)","(0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...",0.0
4,False,MacOS,"Secondary school (e.g. American high school, G...","I am not primarily a developer, but I write co...",Netherlands,I am not interested in new job opportunities,35,"(0.0, 1.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(0.0, 1.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ...","(0.0, 1.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ...",0.0


### Podzial na zbior treningowy/testowy

In [38]:
training, test = transformed_df.randomSplit([0.8, 0.2], seed=1234)

In [39]:
training.count()

27781

### Uczenie modelu - model.fit()

In [40]:
# na poczatek wybierzemy drzewo decyzyjne. Nie musimy podawac zadnych parametrow
from pyspark.ml.classification import LogisticRegression, DecisionTreeClassifier
dt = DecisionTreeClassifier(featuresCol='features', labelCol='label')

In [41]:
simple_model = Pipeline(stages=[dt]).fit(training)

In [42]:
simple_model.stages[0]

DecisionTreeClassificationModel: uid=DecisionTreeClassifier_702a70e7c1fb, depth=5, numNodes=41, numClasses=3, numFeatures=229

### Predykcja - model.transform()

In [43]:
pred_simple = simple_model.transform(test)

In [44]:
show_columns = final_columns + ['prediction', 'rawPrediction', 'probability']
pred_simple.limit(5).select(show_columns).toPandas()

Unnamed: 0,compAboveAvg,OpSys,EdLevel,MainBranch,Country,JobSeek,YearsCode,onehot_OpSys,onehot_EdLevel,onehot_MainBranch,onehot_Country,onehot_JobSeek,onehot_YearsCode,features,label,prediction,rawPrediction,probability
0,False,,,I am a developer by profession,India,,8,"(0.0, 0.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0)","(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,0.0,"[12420.0, 3229.0, 0.0]","[0.7936609368010735, 0.20633906319892645, 0.0]"
1,False,,,I am a developer by profession,India,I am actively looking for a job,6,"(0.0, 0.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 1.0)","(0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,0.0,"[12420.0, 3229.0, 0.0]","[0.7936609368010735, 0.20633906319892645, 0.0]"
2,False,,,I am a developer by profession,India,"I’m not actively looking, but I am open to new...",3,"(0.0, 0.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,0.0,"[12420.0, 3229.0, 0.0]","[0.7936609368010735, 0.20633906319892645, 0.0]"
3,False,,,I am a developer by profession,Nepal,"I’m not actively looking, but I am open to new...",5,"(0.0, 0.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,0.0,"[12420.0, 3229.0, 0.0]","[0.7936609368010735, 0.20633906319892645, 0.0]"
4,False,,,I am a developer by profession,Sri Lanka,,4,"(0.0, 0.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,0.0,"[12420.0, 3229.0, 0.0]","[0.7936609368010735, 0.20633906319892645, 0.0]"


## Ewaluacje

In [45]:
label_and_pred = pred_simple.select('label', 'prediction')
label_and_pred.groupBy('label', 'prediction').count().toPandas()

Unnamed: 0,label,prediction,count
0,1.0,1.0,2252
1,0.0,1.0,665
2,1.0,0.0,878
3,0.0,0.0,3180


In [46]:
# Ewaluator 
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction", metricName="areaUnderROC")

In [47]:
auroc_simple = evaluator.evaluate(pred_simple)
auroc_simple

0.5710002201938538

In [48]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator_m = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator_m.evaluate(pred_simple)
accuracy

0.7787813620071684

## Dodanie hiperparametrów 

In [49]:
# Jake wartosci hiperparametru maxDepth maja byc przetwstowane
from pyspark.ml.tuning import ParamGridBuilder
param_grid = ParamGridBuilder().\
    addGrid(dt.maxDepth, [2,3,4,5,6]).\
    build()

In [50]:
# Walidacja krzyrzowa wykonwyana w celu optymalizaji hiperparametrow
from pyspark.ml.tuning import CrossValidator
cv = CrossValidator(estimator=dt, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=4)

In [51]:
# Budowa modelu na podstawie danych treningowych
cv_model = cv.fit(training)

In [52]:
cv_model.bestModel

DecisionTreeClassificationModel: uid=DecisionTreeClassifier_702a70e7c1fb, depth=2, numNodes=5, numClasses=3, numFeatures=229

## Predykcja z nowym modelem

In [53]:
# Jak wyglada predykcja na zbiorze danych treninigowych?
pred_cv = cv_model.transform(test)
show_columns = final_columns + ['prediction', 'rawPrediction', 'probability']
pred_cv.limit(5).select(show_columns).toPandas()

Unnamed: 0,compAboveAvg,OpSys,EdLevel,MainBranch,Country,JobSeek,YearsCode,onehot_OpSys,onehot_EdLevel,onehot_MainBranch,onehot_Country,onehot_JobSeek,onehot_YearsCode,features,label,prediction,rawPrediction,probability
0,False,,,I am a developer by profession,Germany,"I’m not actively looking, but I am open to new...",16,"(0.0, 0.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,0.0,"[13869.0, 5323.0, 0.0]","[0.7226448520216757, 0.2773551479783243, 0.0]"
1,False,,"Associate degree (A.A., A.S., etc.)",I am a developer by profession,Spain,"I’m not actively looking, but I am open to new...",6,"(0.0, 0.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0)","(0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, ...",0.0,0.0,"[13869.0, 5323.0, 0.0]","[0.7226448520216757, 0.2773551479783243, 0.0]"
2,False,,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",I am a developer by profession,India,"I’m not actively looking, but I am open to new...",3,"(0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,0.0,"[13869.0, 5323.0, 0.0]","[0.7226448520216757, 0.2773551479783243, 0.0]"
3,False,,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",I am a developer by profession,India,"I’m not actively looking, but I am open to new...",8,"(0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0)","(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,0.0,"[13869.0, 5323.0, 0.0]","[0.7226448520216757, 0.2773551479783243, 0.0]"
4,False,,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",I am a developer by profession,Philippines,"I’m not actively looking, but I am open to new...",6,"(0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0)","(0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,0.0,"[13869.0, 5323.0, 0.0]","[0.7226448520216757, 0.2773551479783243, 0.0]"


In [54]:
# Confusion matrix
label_and_pred = pred_cv.select('label', 'prediction')
label_and_pred.groupBy('label', 'prediction').count().toPandas()

Unnamed: 0,label,prediction,count
0,1.0,1.0,1765
1,0.0,1.0,386
2,1.0,0.0,1365
3,0.0,0.0,3459


In [55]:
auroc_cv = evaluator.evaluate(pred_cv)
auroc_cv

0.6798655155652127

In [56]:
acc_cv = evaluator_m.evaluate(pred_cv)
acc_cv

0.7489605734767025

## Klasyfikacja za pomca Gradient Boosted Trees

In [57]:
from pyspark.ml.classification import GBTClassifier
gbt = GBTClassifier(labelCol="label", featuresCol="features", maxIter=10)
model = gbt.fit(training)

In [58]:
evaluator.evaluate(model.transform(test))

0.8791516304731669

## Zadania:

* Czy mozna jeszcze poprawic jakosc predykcji: 
    * a) dodajac cechy
    * b) zmieniajac model
    * c) lepiej dobierajac parametry modelu ? 

In [None]:
#Kod w R
#library(data.table)
#srv <- fread("survey_results_public.csv")
#srv$OpSys2 <- srv$OpSys == "Windows"
#library(rpart)
#srv$CompAboveAvg <- CompAboveAvg$ConvertedComp > 60e3
#dt_fit = rpart(CompAboveAvg ~ Age + EdLevel + JobSeek + OpSys + YearsCode , data = srv, method = 'class')
#pred_y = predict(dt_fit, type = 'class')
#table(predict(dt_fit, srv[,c("Age" , "EdLevel", "JobSeek", "OpSys", "YearsCode")], type = "class"), srv$CompAboveAvg)
#srv(cor)
