# Data science - quantitative analysis

Data science can be used to numerical data as well.
We do not need to modify it to suite computer analysis, so we can read the data directly in.
Here we examine World Value Survey dataset.

[Narrated code walkthrough](https://www.youtube.com/watch?v=b-vtqbhlBaQ)

In [None]:
import pandas as pd
import numpy as np

data = pd.read_csv( './data/wvs.csv').astype(np.float64)
data['V10'].value_counts()

## Supervised machine learning

We use a general purpose library [scikit-learn](https://scikit-learn.org/stable/) in developing and workingwith these exercises.
1. splitting the data into train and test datasets
1. fit a model to the data
1. examine the model performance using test dataset

In [None]:
from sklearn import tree
from sklearn.model_selection import train_test_split

In [None]:
y = data['V10'] ## data we try to predidct
x = data.drop( 'V10', axis = 1) ## data we used to predict, let's remove the predictor from there

## We simplify data used for predicting and only use V4, V5, V6, V7, V8 and V9
x = x[['V4', 'V5', 'V6', 'V7', 'V8', 'V9']]

x_train, x_test, y_train, y_test = train_test_split( x, y , test_size=0.3, random_state=42)

In [None]:
y_train.value_counts()

In [None]:
y_test.value_counts()

In [None]:
model = tree.DecisionTreeClassifier()
model = model.fit( x_train, y_train )

In [None]:
pred = model.predict( x_test )

In [None]:
from sklearn.metrics import *

print( confusion_matrix( y_test, pred ) )
print( classification_report( y_test, pred ) )

## Exercises

* `V10` has several unwanted values: `-5`, `-2` and `-1`. Remove them from the data and rerun the analysis.
* What other variables would you add to the analysis? Do they improve accuracy? See [survey documentation](https://www.worldvaluessurvey.org/WVSDocumentationWV6.jsp) for the meaning of variables.
* What other methods than `DecisionTreeClassifier` exists (see [scikit-learn supervised learning documentation](https://scikit-learn.org/stable/supervised_learning.html) documentation)? Try out them. Do you get better results?
* What does cross validation mean? Try out cross validation and test out these things.

## Unsupervised machine learning

Beyond seeking to classify data based on existing attributes, in some questions you may want to find groups of similar entries from the data.
This is unsupervised machine learning; several methods excists for this.

In [None]:
from sklearn.cluster import KMeans

data_for_kmeans = data.iloc[1:500,].copy() ## slices the data to be smaller and easier

model = KMeans(n_clusters=5, random_state=42)
clusters = model.fit_predict( data_for_kmeans )

In [None]:
## add clusters to our data

data_for_kmeans['clusters'] = clusters

data_for_kmeans.groupby('clusters').agg('mean')

## Exercises

* Which cluster has highest number of data points
* How does changing the number of clusters change there results