# Social Networks - Assignment 0 [optional]

This **Home Assignment** is **optional**, it does not count towards your final chance to take part in the exam. It familiarizes you with basics of *statistics* and basics of the *sklearn* package and most importantly with the general setup for home assignments.
This first home assignment is shorter and also less difficult than upcoming ones.

You can expect numpy, pandas and sklearn to be installed

## Formalities

**Submit in a group of 3-4 people until 28.04.2022 23:59CET. The deadline is strict!**

## Evaluation and Grading
General advice for programming excercises at *CSSH*:
Evaluation of your submission is done semi-automatically. Think of it as this notebook being 
executed once. Afterwards, some test functions are appended to this file and executed respectively.

Therefore:
* Submit valid _Python3_ code only!
* Use external libraries only when specified by task.
* Ensure your definitions (functions, classes, methods, variables) follow the specification if
  given. The concrete signature of e.g. a function usually can be inferred from task description, 
  code skeletons and test cases.
* Ensure the notebook does not rely on current notebook or system state!
  * Use `Kernel --> Restart & Run All` to see if you are using any definitions, variables etc. that 
    are not in scope anymore.
* Keep your code idempotent! Running it or parts of it multiple times must not yield different
  results. Minimize usage of global variables.
* Ensure your code / notebook terminates in reasonable time.

**There's a story behind each of these points! Don't expect us to fix your stuff!**

Regarding the scores, you will get no points for a task if:
- your function throws an unexpected error (e.g. takes the wrong number of arguments)
- gets stuck in an infinite loop
- takes much much longer than expected (e.g. >1s to compute the mean of two numbers)
- does not produce the desired output (e.g. returns an descendingly sorted list even though we asked for ascending, returns the mean and the std even though we asked for only the mean, prints an output instead of returning it!)

In [1]:
# credentials of all team members (you may add or remove members from the list)
team_members = [
    {
        'first_name': 'Felix',
        'last_name': 'Stamm',
        'student_id': 123451
    },
    {
        'first_name': 'Bob',
        'last_name': 'Bar',
        'student_id': 54321
    }
]

In [2]:
from typing import List, Union
from numbers import Number
from unittest import TestCase
some_testCase = TestCase()

# General remarks


- Even though this is an optional Assignment, use this opportunity to test/train for the real home assignments
- Python has weak scoping, be careful not to use variables from outside instead of those defined in the functions
- You are welcome to add additional usecases to test your function
- Try to keep your code as deterministic as possible

## Task 1 (5 points total)
To refresh your knowledge on basic statistics we are going to implement mean, mode, median and standard deviation. All these functions should leave the input argument intact. Try to come up with your own solution and do *not* just use the functions from the statistics/numpy/pandas package.


### 1a) Mean (1)
Write a function ```my_mean``` that takes a list of numeric values and returns the mean. 

In [3]:
def my_mean(l : List[Number]) -> Number:
    return 0

In [5]:
print(my_mean([1,2,3]))# 2
print(my_mean([1,2,3,4]))# 2.5

2.0
2.5


### 1b) Std (1)
Write a function ```my_std``` that takes a list of numeric values and returns the standard deviation. Divide by n and not by n-1.

In [6]:
def my_std(l : List[Number]) -> Number:
    return 0

In [8]:
print(my_std([3,4])) #0.5

0.5


### 1c) Mode (2)
Write a function ```my_mode``` that takes a list and returns the mode.
If there is no unique mode, raise a ValueError.


In [9]:
def my_mode(l : List):
    return 0
    raise ValueError("No unique mode") # how to raise an error

In [13]:
print(my_mode([3,3,4])) #3
print(my_mode([2.7,2.7,4])) #2.7
with some_testCase.assertRaises(ValueError):
    print(my_mode([3,4])) #ValueError

3
2.7


### 1d) Median (1)
Write a function ```my_median``` that takes a list of numeric values and returns the median.

In [None]:
def my_median(l : List[Number]) -> Number:
    return 0

In [None]:
print(my_median([1,2,3])) # 2
print(my_median([1,2,3,4]))# 2.5

## Task 2 (10 points)
In this task we are going to explore basic classifiers and the sklearn package.
### 2a) Preprocessing (2)
Write a function ```preprocess```. It takes a **relative** path as string as input, and optionally a random_state.

It does:

- read the credit_g dataset into a pandas dataframe. Assume that the path points to the file. Assume that missing values are given as `'?'`
- compute the boolean target vector (True if `'class'` is `'good'`)
- remove the target column (`'class'`) from the dataframe
- convert the categorical variables to numeric ones using `pd.get_dummies`
- perform a (80/20) train/test split using `sklearn.model_selection.train_test_split` with the provided `random_state`
- returns the results of the train/test split in the order obtained from the function


In [39]:
my_path = "credit-g.csv"
# modify your path here

In [41]:
import pandas as pd
import numpy as np
from pathlib import Path

def preprocess(path : str, random_state = 1) -> (pd.DataFrame, Union[pd.Series, np.array]):
    pass

In [43]:
# example usage
X_train, X_test, y_train, y_test = preprocess(my_path)
print(X_train.head().index)
#Int64Index([382, 994, 982, 47, 521], dtype='int64')



Int64Index([382, 994, 982, 47, 521], dtype='int64')


### 2b) Train linear SVM classifier (1)
Write a function ```train_LinearSVM_classifier``` that trains a Linear Support Vector classifier.

It takes four arguments, the first one is the train dataset, the second the target array, the third a random state, the last one the maximum number of iterations until the classifier should stop training. It returns the trained classifier.
Use the Linear support vector classifier from sklearn with the given `random_state` and `max_iter`.


### 2c) Train logistic regression classifier (1)
Write a function ```train_LogisticRegression_classifier``` that trains a Logistic regression classifier.

It takes four arguments, the first one is the train dataset, the second the target array, the third a random state, the last one the maximum number of iterations until the classifier should stop training. It returns the trained classifier.
Use the logistic regression classifier from sklearn with the given `random_state` and `max_iter`.

In [44]:
def train_LinearSVM_classifier(X_train, y_train, random_state=1, max_iter=1000):
    pass

In [46]:
# example usage
train_LinearSVM_classifier(X_train, y_train) # LinearSVC(random_state=1)
# We just ignore the warning



LinearSVC(random_state=1)

In [47]:
def train_LogisticRegression_classifier(X_train, y_train, random_state=1, max_iter=1000):
    pass

In [49]:
# example usage
train_LogisticRegression_classifier(X_train, y_train) # LogisticRegression(max_iter=1000, random_state=1)

LogisticRegression(max_iter=1000, random_state=1)


### 2d) Evaluate the results  (4)
Write a function ```get_scores``` that computes the precision, recall, accuracy and F1 scores.
It takes three arguments. The first one is a trained classifier, the second one is the test dataset to evaluate the classifier on, the third is the ground truth target vector.
The function returns a dictionary like this:

```python
{'accuracy' : accuracy,
 'recall' : recall,
 'precision' : precision,
 'F1' : F1}
 ```

Try to program the solution yourself instead of using the functions from sklearn


In [50]:
def get_scores(clf, x_test, gt_labels) -> dict:
    accuracy=0
    recall=0
    precision=0
    F1=0
    return     {'accuracy' : accuracy,
                 'recall' : recall,
                 'precision' : precision,
                 'F1' : F1}

In [52]:
# example usage
clf = train_LogisticRegression_classifier(X_train, y_train)
get_scores(clf, X_test, y_test)

# expected output
#{'accuracy': 0.77,
# 'recall': 0.900709219858156,
# 'precision': 0.7987421383647799,
# 'F1': 0.8466666666666667}
# Don't worry if the last 3,4,5 digits don't match

{'accuracy': 0.77,
 'recall': 0.900709219858156,
 'precision': 0.7987421383647799,
 'F1': 0.8466666666666667}


### 2 e) Bringing it all together  (2)
Write two functions: ```run_SVM``` and ```run_Log``` that use the above functions to train and evaluate an SVM classifier and a Logistic regression classifier respectively.
It therefore

1. loads the dataset & performs a train/test split
2. trains the respective classifier using the `random_state` and `max_iter`
3. returns the scores dictionary

Thereby, use the functions ```preprocess```, ```train_LinearSVM_classifier```, ```train_LogisticRegression_classifier```, ```get_scores``` you defined above.

In [35]:
def run_Log(path : str, random_state=1, max_iter=1000):
    pass

def run_SVM(path : str, random_state=1, max_iter=1000):
    pass

In [53]:
print(run_Log(my_path))
# {'accuracy': 0.77, 'recall': 0.900709219858156, 'precision': 0.7987421383647799, 'F1': 0.8466666666666667}
print(run_SVM(my_path))
# {'accuracy': 0.72, 'recall': 0.9929078014184397, 'precision': 0.717948717948718, 'F1': 0.8333333333333333}
# we ignore warning

{'accuracy': 0.77, 'recall': 0.900709219858156, 'precision': 0.7987421383647799, 'F1': 0.8466666666666667}
{'accuracy': 0.72, 'recall': 0.9929078014184397, 'precision': 0.717948717948718, 'F1': 0.8333333333333333}




In [54]:
print(run_SVM(my_path, random_state=123))
# {'accuracy': 0.72, 'recall': 1.0, 'precision': 0.7157360406091371, 'F1': 0.834319526627219}
# we ignore warning

{'accuracy': 0.72, 'recall': 1.0, 'precision': 0.7157360406091371, 'F1': 0.834319526627219}


