<a href="https://colab.research.google.com/github/Vastra-Gotalandsregionen/AI_utbildning/blob/main/dag2/ML_%C3%B6vning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
# **Maskininlärning med Python**
###**Mål: Bygg en klassificeringmodell för att predicera bröstcancer** 

### **Kod & data:**
- URL till original-notebooken: **tinyurl.com/2v2u9y4p** 
  - Ladda upp och kör i ert eget google colab-konto
- Vi kommer ladda in ett öppet dataset med data från bröstcancerpatienter
  - Datasetet är redan städat, transformerat och bearbetat, så vi kan använda det som det är
  - <font color="red">**OBS! Ladda aldrig upp sekretessbelagd data (t.ex. patientdata) på google colab!**</font>

### **Bra att veta:**
- Från och med **Övning 1** så kommer det stå `...` här och var i koden, där ska ni fylla i kod själva
  - Läs övningsinstruktionerna för att lista ut vad som ska stå där
- Kör en specifik kod-cell genom att trycka på play-knappen i övre vänstra hörnet av cellen
  - Kom ihåg att ordningen som du kör cellerna i spelar roll, om det finns kod i cell A som behövs i cell B så behöver du köra A före B
  - Om kod i cell B skriver över variabler i cell A, så kommer det påverka vad som händer om du kör A igen



In [None]:
"""
Först importerar vi de paket vi behöver:
- Vi använder pandas för att hantera tabulärdata
- Vi använder sklearn för allting ML-relaterat
- Notera att pandas har aliaset pd, vilket innebär att vi når dess funktioner med prefixet pd.
"""
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split

pd.options.display.max_columns = None  # Ställer in så pandas dataframes inte kollapsar (döljer) kolumner när vi skriver ut en tabell


"""
Sedan laddar vi in data med datasets.load_breast_cancer():
- X innehåller alla variabler som vi vill använda som input vid prediktion
- Y innehåller den variabel som är vår modells "target", d.v.s det modellen predicerar
"""
X, y = datasets.load_breast_cancer(return_X_y=True, as_frame=True)    # Laddar X- och y-data
y = y.apply(lambda x: datasets.load_breast_cancer().target_names[x])  # Lägger till etiketter till y-data


"""
Slutligen så delar vi upp data i träning- och test-data:
- X_train och y_train är data som ska användas vid träning av modellen
- X_test och y_test är data som ska användas vid utvärdering av modellen
"""
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.6, random_state=1234)


---
# Övning 1 - Bekanta er med datasetet

Data är laddad som pandas DataFrames, vilket innebär att data struktureras som tabeller med kolumner och rader
- pandas (alias pd) har inbyggda funktioner som kan användas för att hantera dataseten
- Om ni t.ex. skriver `X_train.head()` så ser ni de fem första raderna i X_train-datasetet

**Uppgifter**:
1. Skriv `y_train.value_counts()` för att räkna hur många fall som hade utfallet "benign" respektive "malignant"
2. Skriv `X_train.head()` eller `X_train.tail()` för att visa de fem första/sista raderna
  - Vill ni t.ex. visa 10 rader så kan ni skriva `X_train.head(10)`
3. Använd `.describe()` på `X_train` för att få en sammanfattning av datasetet
4. Använd `.corr()` på `X_train` för att ta fram en korrelationsmatris


In [None]:
# 1.1. .value_counts()
y_train.value_counts()


benign       213
malignant    128
Name: target, dtype: int64

In [None]:
# 1.2 .head() or .tail()
X_train.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
322,12.86,13.32,82.82,504.8,0.1134,0.08834,0.038,0.034,0.1543,0.06476,0.2212,1.042,1.614,16.57,0.00591,0.02016,0.01902,0.01011,0.01202,0.003107,14.04,21.08,92.8,599.5,0.1547,0.2231,0.1791,0.1155,0.2382,0.08553
93,13.45,18.3,86.6,555.1,0.1022,0.08165,0.03974,0.0278,0.1638,0.0571,0.295,1.373,2.099,25.22,0.005884,0.01491,0.01872,0.009366,0.01884,0.001817,15.1,25.94,97.59,699.4,0.1339,0.1751,0.1381,0.07911,0.2678,0.06603
324,12.2,15.21,78.01,457.9,0.08673,0.06545,0.01994,0.01692,0.1638,0.06129,0.2575,0.8073,1.959,19.01,0.005403,0.01418,0.01051,0.005142,0.01333,0.002065,13.75,21.38,91.11,583.1,0.1256,0.1928,0.1167,0.05556,0.2661,0.07961
407,12.85,21.37,82.63,514.5,0.07551,0.08316,0.06126,0.01867,0.158,0.06114,0.4993,1.798,2.552,41.24,0.006011,0.0448,0.05175,0.01341,0.02669,0.007731,14.4,27.01,91.63,645.8,0.09402,0.1936,0.1838,0.05601,0.2488,0.08151
404,12.34,14.95,78.29,469.1,0.08682,0.04571,0.02109,0.02054,0.1571,0.05708,0.3833,0.9078,2.602,30.15,0.007702,0.008491,0.01307,0.0103,0.0297,0.001432,13.18,16.85,84.11,533.1,0.1048,0.06744,0.04921,0.04793,0.2298,0.05974


In [None]:
# 1.3 .describe()
X_train.describe()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
count,341.0,341.0,341.0,341.0,341.0,341.0,341.0,341.0,341.0,341.0,341.0,341.0,341.0,341.0,341.0,341.0,341.0,341.0,341.0,341.0,341.0,341.0,341.0,341.0,341.0,341.0,341.0,341.0,341.0,341.0
mean,14.235378,19.364633,92.727478,666.362757,0.096371,0.105316,0.091355,0.04983,0.182095,0.062733,0.405384,1.230771,2.896327,40.487528,0.007018,0.025913,0.032203,0.011841,0.020386,0.003769,16.40271,25.927654,108.402581,895.107918,0.133289,0.264462,0.287925,0.117601,0.292818,0.084867
std,3.610571,4.358095,24.815378,362.403166,0.014282,0.052911,0.079349,0.0392,0.026128,0.007186,0.275317,0.55663,2.065895,43.432908,0.002821,0.018012,0.027167,0.005987,0.007916,0.002257,4.909726,6.317725,34.253667,570.480168,0.023967,0.171489,0.22455,0.067166,0.061439,0.019981
min,6.981,10.72,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,0.1115,0.3602,0.757,7.228,0.002866,0.002252,0.0,0.0,0.007882,0.000895,7.93,12.49,50.41,185.2,0.08125,0.03432,0.0,0.0,0.1565,0.05504
25%,11.85,15.94,76.37,432.2,0.08605,0.06669,0.02966,0.02068,0.1633,0.05766,0.2315,0.8282,1.648,17.74,0.005233,0.01308,0.0151,0.008,0.01527,0.002248,13.14,21.08,85.08,525.1,0.1178,0.1506,0.1167,0.06498,0.2552,0.07127
50%,13.3,18.9,86.49,546.1,0.09495,0.09486,0.06505,0.0339,0.1801,0.06127,0.3152,1.142,2.287,23.92,0.006428,0.02083,0.02703,0.0111,0.01857,0.003211,14.84,25.73,97.65,675.2,0.1312,0.217,0.2413,0.1015,0.2844,0.08004
75%,16.07,21.97,105.8,800.0,0.1054,0.1318,0.1425,0.07944,0.1964,0.06612,0.4768,1.475,3.312,45.42,0.008074,0.03318,0.04266,0.01471,0.02292,0.004572,19.38,30.36,128.8,1165.0,0.1467,0.3458,0.4,0.1708,0.32,0.0927
max,28.11,33.81,188.5,2499.0,0.1425,0.3114,0.4264,0.1913,0.2743,0.09744,2.873,3.896,21.98,525.6,0.02177,0.1354,0.3038,0.0409,0.06146,0.01792,33.13,49.54,229.3,3234.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


In [None]:
# 1.4 .corr()
X_train.corr()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
mean radius,1.0,0.318065,0.997855,0.988596,0.137013,0.472247,0.662236,0.817763,0.088062,-0.350565,0.691811,-0.100366,0.680722,0.756758,-0.290499,0.17122,0.174191,0.381039,-0.152753,-0.048092,0.971249,0.277154,0.963783,0.951661,0.091022,0.354292,0.462018,0.722969,0.105048,-0.01877
mean texture,0.318065,1.0,0.328624,0.309854,0.011727,0.295442,0.335239,0.318563,0.104585,-0.016666,0.268643,0.400117,0.280757,0.252178,-0.048762,0.202503,0.158066,0.172315,-0.015431,0.102556,0.349523,0.91451,0.36516,0.337699,0.11422,0.308837,0.319502,0.323214,0.141075,0.173152
mean perimeter,0.997855,0.328624,1.0,0.987302,0.173369,0.524562,0.702922,0.846701,0.121518,-0.301072,0.703397,-0.089729,0.69892,0.764624,-0.272855,0.216268,0.209722,0.411337,-0.138363,-0.007353,0.972806,0.289576,0.970886,0.953243,0.125051,0.399429,0.501617,0.752394,0.129475,0.027128
mean area,0.988596,0.309854,0.987302,1.0,0.143126,0.465451,0.670788,0.820867,0.086596,-0.322561,0.739196,-0.075302,0.726549,0.809935,-0.229935,0.17449,0.187747,0.375742,-0.106373,-0.027833,0.962478,0.261569,0.955074,0.962963,0.089021,0.331628,0.446245,0.700941,0.083432,-0.024583
mean smoothness,0.137013,0.011727,0.173369,0.143126,1.0,0.638582,0.519858,0.53627,0.539908,0.57636,0.266098,0.104885,0.255894,0.223573,0.344439,0.292884,0.235341,0.343612,0.148885,0.305478,0.195199,0.100626,0.217891,0.191312,0.830714,0.465262,0.435163,0.495812,0.390188,0.482962
mean compactness,0.472247,0.295442,0.524562,0.465451,0.638582,1.0,0.89129,0.814191,0.56287,0.561702,0.465928,0.067804,0.519884,0.434879,0.101043,0.741208,0.599894,0.626847,0.13066,0.567721,0.524939,0.331925,0.582263,0.503792,0.607793,0.876252,0.827172,0.826384,0.501116,0.711206
mean concavity,0.662236,0.335239,0.702922,0.670788,0.519858,0.89129,1.0,0.918344,0.458234,0.322501,0.610891,0.054373,0.64228,0.613453,0.026237,0.64617,0.677787,0.647635,0.073408,0.448151,0.693672,0.347277,0.735121,0.685861,0.477916,0.760652,0.876049,0.87119,0.393349,0.532767
mean concave points,0.817763,0.318563,0.846701,0.820867,0.53627,0.814191,0.918344,1.0,0.409421,0.13401,0.69531,0.024154,0.702733,0.704385,-0.0283,0.456333,0.421042,0.602653,-0.000449,0.269241,0.839263,0.325068,0.862298,0.82886,0.452943,0.63033,0.711271,0.901742,0.324536,0.349537
mean symmetry,0.088062,0.104585,0.121518,0.086596,0.539908,0.56287,0.458234,0.409421,1.0,0.47902,0.252116,0.128033,0.258228,0.170415,0.208462,0.406693,0.316997,0.369111,0.407175,0.349772,0.14438,0.138843,0.176052,0.13308,0.457401,0.455002,0.417357,0.399411,0.675476,0.429053
mean fractal dimension,-0.350565,-0.016666,-0.301072,-0.322561,0.57636,0.561702,0.322501,0.13401,0.47902,1.0,-0.053444,0.177765,-0.003385,-0.131564,0.423969,0.548635,0.434323,0.280815,0.322449,0.720676,-0.274989,0.042689,-0.225738,-0.257267,0.548285,0.492138,0.378034,0.181234,0.378115,0.779172


--- 
# Övning 2 - Standardisering av X-data

Vi använder `StandardScaler()` för att standardisera vår numeriska X-data (vi standardiserar inte y-datan eftersom den är kategorisk). 

Standardisering gör det lättare att tolka samband och jämföra variabler mot varandra. Standardiserad data har följande egenskaper:
- medelvärde=0
- standardavvikelse=1

Notera att det finns olika typer av skalning för olika situationer, neurala nätverk använder t.ex. oftast `MinMaxScaler` istället för `StandardScaler`, som normaliserar data så att alla datapunkter hamnar inom spannet 0 och 1.

**Uppgifter**:

1. Vilken av `X_train, X_test, y_train, y_test` bör vi använda för att träna vår scaler?
  <details><summary>Hint</summary>Använd aldrig testdata när för att träna en modell</details>
  <details><summary>Lösning</summary>X_train_scaled = scaler.fit_transform(X_train)</details>
2. Vilken av `X_train, X_test, y_train, y_test` bör vi använda när vi ska transformera (men inte träna) vår scaler?
  <details><summary>Hint</summary>Vi ska använda testdata</details>
  <details><summary>Lösning</summary>X_test_scaled = scaler.transform(X_test)</details>

In [None]:
from sklearn.preprocessing import StandardScaler

# Definiera vår scaler
scaler = StandardScaler()

# 2.1. Träna scaler på träningsdata
X_train_scaled = scaler.fit_transform(X_train)

# 2.2 Transformera testdata
X_test_scaled = scaler.transform(X_test)


---
# Övning 3 - Träning och prediktion

Vi ska använda den skalade träningsdatan för att träna en modell, och sedan den skalade testdatan för att ta fram prediktioner. 

Slutligen så ska vi räkna ut en confusion matrix där vi kan se hur många av de predicerade värdena för respektive utfall som faktiskt var korrekta enligt "facit"

**Uppgifter**:
1. Definiera en logistik regressionsmodell enligt: `model = <modell som vi har importerat>()`
  <details><summary>Hint</summary>Titta på vilken modell vi importerar</details>
  <details><summary>Lösning</summary>model = LogisticRegression()</details>
2. Träna modellen med träningsdata enligt: `model.fit(<X-data>, <y-data>)`
  <details><summary>Hint</summary>Använd aldrig testdata för att träna en modell och kom ihåg att använda skalad X-data</details>
  <details><summary>Lösning</summary>model.fit(X_train_scaled, y_train)</details>
3. Predicera med testdata enligt: `model.predict(<X-data>)`
  <details><summary>Hint</summary>Använd den skalade testdatan för att predicera</details>
  <details><summary>Lösning</summary>model.predict(X_test_scaled)</details>
4. Ta fram en confusion matrix enligt: `confusion_matrix(<sann y-data>, <predicerad y-data>)`
  <details><summary>Hint</summary>Använd facit-data och de prediktioner vi fått ut</details>
  <details><summary>Lösning</summary>cm = confusion_matrix(y_test, y_pred)</details>

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

# 3.1. Välj modelltyp
model = LogisticRegression()

# 3.2 Träna modell
model.fit(X_train_scaled, y_train)

# 3.3 Gör prediktioner
y_pred = model.predict(X_test_scaled)

# 3.4 Confusion matrix
cm = confusion_matrix(y_test, y_pred)
cm = pd.DataFrame(cm, columns=["predicted_benign", "predicted_maglignant"], index=["actually_benign", "actually_malignant"])
display(cm)

Unnamed: 0,predicted_benign,predicted_maglignant
actually_benign,144,0
actually_malignant,9,75


---
# Övning 4 - Utvärdera modellen

Vi kan tolka modellen utifrån ett antal olika mått. I det här fallet så kommer vi inspektera confusion matrix, samt beräkna accuracy. 
- Notera att accuracy kan vara missvisande när datan i y är väldigt obalanserad (t.ex. om nästan alla fall har samma utfall) - vi kommer inte titta på det nu, men bra att känna till

**Uppgifter:**
1. Räkna ut `accuracy` (andel korrekta prediktioner)
  <details><summary>Hint</summary>accuracy = korrekta prediktioner / alla prediktioner</details>
  <details><summary>Lösning</summary>accuracy = number_of_correct_predictions / number_of_predictions </details>
2. Försök tolka `accuracy` och confusion matrix (`cm`)
  - Hur väl verkar modellen prestera överlag?
  - Jämför resultatet för de två utfallen "malignant" och "benign" i confusion matrix, verkar modellen vara bra på att predicera båda utfallen?


In [None]:
# 4.1. Beräkna accuracy
number_of_correct_predictions = sum(y_test == y_pred) 
number_of_predictions = len(y_pred)
accuracy = number_of_correct_predictions / number_of_predictions
print(f"{accuracy=:.2f}")

# 4.2. Tolka accuracy och confusion matrix
display(cm)

"""
<skriv din tolkning här>



""";

accuracy=0.96


Unnamed: 0,predicted_benign,predicted_maglignant
actually_benign,144,0
actually_malignant,9,75


---
# Övning 5 - Skriv in egna X-värden få en prediktion

När vi har en tränad modell som vi är nöjd med så kan vi börja föra in ny data och göra prediktioner. 

**Uppgift**: 
1. Testa skriv in egen X-data och se hur det påverkar det predicerade utfallet
  - Skriv in kolumnnamn och dess värde (titta i tabellen x_variables nedan för hitta namnen och rimliga värden)
  - Försök lista ut vilka variabler som ökar/minskar sannolikheten mest
  <details><summary>Hint</summary>Titta i x_variables för att se vilka variabler som har högre/lägre koefficienter</details>

Notera att vi i det här fallet använder `model.predict_proba()` för att få ut sannolikheterna i procent; `model.predict()` skulle ge oss den faktiska prediktionen (t.ex. "malignant")

In [None]:
# 5.1. Enter custom data: {"variable name": value}
custom_data = [{
    "mean concave points": 0.05,
    "worst area": 892,
    "radius error": 0.4,
    "mean area": 700,
    "worst symmetry": 0.3,
}]

# Update patient_data
patient_data = pd.DataFrame([X_train.mean()])
patient_data.update(custom_data)
patient_data_scaled = scaler.transform(patient_data)

# Predict outcome
prediction = model.predict_proba(patient_data_scaled)[0][1] * 100
print(f"Predicted probability of malignant tumor: {prediction:.1f}%")

In [None]:
# Inspect X variables
x_variables = pd.DataFrame(model.coef_, columns=X_train.columns, index=["Coefficient"]).T
x_variables["Mean value"] = X_train.mean()
x_variables["Min value"] = X_train.min()
x_variables["Max value"] = X_train.max()
x_variables["Standard deviation"] = X_train.std()
display(x_variables)

---
# Övning 6 - Fri lek

Har ni tid över så kan ni testa på att jobba med data/modeller helt själva.

**Förslag**:
- Hitta fler pandas-funktioner här: [Pandas CheatSheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
- Hitta fler sklearn-funktioner här: [sklearn CheatSheet](https://res.cloudinary.com/dyd911kmh/image/upload/v1676302389/Marketing/Blog/Scikit-Learn_Cheat_Sheet.pdf)
- Räkna ut fler utvärderingsmått - presterar modellen bra enligt alla mått?
  - `precison = true_positive / (true_positive + false_positive)`
  - `recall = true_positive / (true_positive + false_negative)`
  - `f1 = 2 * (precision * recall) / (precision + recall)`
- Gå t.ex. till [sklearn.datasets](https://scikit-learn.org/stable/datasets/toy_dataset.html) och hitta fler slags modeller att träna
  - Notera att vilken y-variabel du väljer påverkar vilken slags modell du kan använda; kategoriska y-variabler kräver en modeller av typen "classification", medan numeriska y-variabler kräver modeller av typen "regression" (inte att förväxla med regressionsmodell)
- Gå t.ex. till [sklearn.linear_model](https://scikit-learn.org/stable/modules/linear_model.html) och hitta flera dataset att analysera

In [None]:
# Vi kan använda pandas för att skapa nya tabeller och en massa annat
outcomes = pd.DataFrame()                                            # Skapar en tom dataframe
outcomes["count"] = y_train.value_counts()                           # Ny kolumn som består av value_counts() för y
outcomes["proportion"] = round(outcomes["count"] / len(y_train), 2)  # Ny kolumn där vi delar count med antalet rader i y_train
outcomes["percent"] = outcomes["proportion"] * 100                   # Ny kolumn där vi räknar om proportion till procent
outcomes.loc["total"] = outcomes.apply(sum)                          # Ny rad där vi summerar kolumnernas värden
display(outcomes)                                                    # Visar vår nya tabell

# Vi kan skriva ut rader, kolumner och celler för sig
counts = dict(outcomes["count"])      # Kolumnen count
totals = dict(outcomes.loc["total"])  # Raden total
benign = outcomes["count"]["benign"]  # Cellen [count, benign]

print(f"{counts = }")
print(f"{totals = }")
print(f"{benign = }")

In [None]:
# Vi kan även spara dataframe som .csv, .excel och liknande format
# Notera: I vänsterfliken i google colab finns en tab som heter "Files" där filer sparas/läses från
outcomes.to_csv("outcomes.csv")

# Samt ladda in från .csv, m.fl
outcomes = pd.read_csv("outcomes.csv", index_col=0)
display(outcomes)
