# End to end introduction to machine learning using python

Workshop lead: Abdoulaye Balde [@abdoulayegk](http://twitter.com/abdoulayegk)<br>
Notebook will be  [abdulayegk]()<br>
[Colab Notebook](https://colab.research.google.com/drive/14adHs2rjCAH0TXyvgCBOgEFRwCBzjh9f?usp=sharing)

# Overview
The goal of this workshop is to give learners a general intro to machine learning and data science with Python using Pandas and Jupyter. 
We will first go through a general overwiew of python such as list, tuple and dictionary.
then we go through the process of loading data from CSV files, inspecting and cleaning the data. As a second step, we will analyse the data and draw some insights about Chronic-kidney dataset. 

The workshop is structured as follows:

- Intro and background
- Part 0: Quick Jupyter exercise
- Part 1: General overview of python
- Part 2: Creation of dataframe and series using pandas
- Part 3: Loading and inspecting data
- Part 4: Data analysis
- Part 5 Model building
- Part 6: Summary

**Note that this workshop is only intended as an introduction to some basic concepts of python for data science using Pandas. It is in no means intended to be comprehensive, and there are a lot of useful functions a beginner needs to know to do in-depth data analysis. I hope that this workshop sets you up for self-guided learning to master the full range of necessary Pandas tools.**

## How to follow along with the workshop
- You can run every cell in the notebook as we go along using the shortcut Shift+Enter

# Intro

## What is Jupyter (and the Jupyter ecosystem...)?
- **IPython** is an **interactive Python shell** (just type "ipython" to start it)
- **Jupyter** is a Python library that provides a **web-based UI** on top of ipython to create notebooks with code and output
- **JupyterLab** provides some additional **features on top of Jupyter**, e.g. a file browser
- **Binder** is a **web-based hub** for containers that contain your Python environment and renders notebooks based on a git repo

## Quick overview of python list, tupl eand  dictionary
- **List**  A list is a data structure in Python that is a mutable, or changeable, ordered sequence of elements. Each element or value that is inside of a list is called an item. Lists are defined by having values between square brackets [ ].
 - **Tuple** A tuple is a data structure that is an immutable, or unchangeable, ordered sequence of elements. Because tuples are immutable, their values cannot be modified. Tuples have values between parentheses ( ) separated by commas.
 
- **Dictionary** The dictionary is Python’s built-in mapping type. Dictionaries map keys to values and these key-value pairs provide a useful way to store data in Python.

    Typically used to hold data that are related, such as the information contained in an ID or a user profile, dictionaries are constructed with curly braces on either side { }.

## What is Pandas/Matplotlib/Pyplot/Seaborn?

- **Pandas** is a Python library for **data manipulation and analysis**. It offers data structures and operations for manipulating numerical tables and time series.
- **Matplotlib** is a Python **2D plotting library**. Pyplot is a collection of command style functions in matplotlib that make matplotlib work like MATLAB. While we mostly use Seaborn, we sometimes fall back to using Pyplot functions for certain aspects of plotting.
- **Seaborn** is a Python **data visualization** library based on matplotlib. It's kind of like a nicer version of Pyplot.
- You can **use Pandas code in a regular Python script** of course. I'm just combining Jupyter + Pandas in this tutorial because notebooks are a great way to immediately see output!

### Notebooks are basically just interactive ipython terminals, often mixed in with markdown text:
- Each input field you see is called a **cell**
- Cells can be **either code or markdown**
- You can execute any kind of Python code
- **Variables persist** between cells
- The notebook **doesn't care about the order of cells**, just the order of executing it in order to remember variables. However, "run all" executes your cells top to bottom.

### Notebooks have **two modes**: a) editing the cells and b) navigating the notebook (command mode):
- You can **navigate** around the notebook in command mode by clicking cells or using the arrow keys
- Depending on the environment you're using (Jupyter notebook, Jupyter lab, Google Colab...) there will be a different **visual cue** (e.g. a colored line) to indicate the mode a cell is in
- In order to **edit a cell**, you can press **Enter** or double-click it.
- To **execute** the cell content, press Shift+Enter to run the cell
- To get **out of edit mode** and back into navigation mode, press the **Escape key**

### Some helpful keyboard shortcuts:
- The **default type for a cell is code**. In command mode, press *m* to make a cell markdown and *y* to make it code
- Press *a* in command mode to create a new cell *above* the current one
- Press *b* in command mode to create a new cell *below* the current one
- *Tab* autocompletes methods (like in IPython)
- *Shift+Tab* shows you the docstring for the outer function of the line your cursor is in
- Press *dd* in command mode to delete a cell. 
- *Cmd+z* undoes operations in the highlighted cell, *z* undoes cell operations in the notebook (e.g. deleting a cell)

# Part 1: General Overview of python 
In this part we are going to go through the basics things we need to know before loading data for that we are going to start from looping in python and we will go till classes in python.<br>
**Note this will be just very basics things we should know to follow along in this workshop if you want to go in deep then you should get a book for that**

In [None]:
# To print your name in python
print("Hello world!")

## List
### Les éléments peuvent être quelconques. 

In [None]:
# Une liste vide
a = []  # an empty list
print(a)

print("\n\n")  # leave two lines blank
b = [1, 2, 3, 4]  # list of numbers
print(b)
print("\n\n")
fruits = ["Orange", "Banana", "Apple"]  # list of fruits
print(fruits)

print("\n\n")
# you can mix list using different datatypes
mylist = [1, "Pineapple", 3.14, [2, 3, 4]]
print(mylist)

## Tuple
### Un tuple, c'est comme une liste, sauf que les éléments ne peuvent pas être changés (non mutable) : t = ('a', 'b'); t[0] = 'c' renvoie une erreur.

In [None]:
# Tuple
t = ()  # an empty tuple
print(t)


mytuple1 = (1, 2, 3, 4, 5)
print(mytuple1)

In [None]:
bar = [(2, 3, 4), (), ("Banana", "orange")]
type(bar)

### Dictionary
Un ensemble d'affectation cle valeur. 

In [None]:
traduction = {"chien": "Dog", "Chat": "Cat", "Guinee": "Guinea"}
traduction

**NB**: la cle doivent etre unique par example comme j'ai une cle chien je peux pas faire entre une autre cle Chien

In [None]:
dic = {}  # an empty dictionary

mydic = {"username": "abdoulayegk", "online": True, "followers": 987}
print(mydic)

In [None]:
mydic["username"]

In [None]:
print(mydic["followers"])
# Returns 987

print(mydic["online"])
# Returns True

In [None]:
# to print the key values pair
for key, value in mydic.items():
    print(key, "is the key for the value", value)

Using Methods to Access Elements <br>
In addition to using keys to access values, we can also work with some built-in methods:<br>

dict.keys() isolates keys <br>
dict.values() isolates values<br>
dict.items() returns items in a list format of (key, value) tuple pairs<br>

In [None]:
# To return all the keys of our dictionary
print(mydic.keys())

In [None]:
# To return the values of a dictionary
print(mydic.values())

# Part 2: creation of Series and DataFrame in Pandas

## What is a dataframe?
* A **dataframe** is a **2-dimensional labeled data structure** with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used Pandas object. 
* Pandas borrows the concept of DataFrame from the statistical programming language R.
* There are a lot of **different ways to read data** into a dataframe - from lists, dicts, CSVs, databases... In this example, we're loading data from a CSV file!

**Let's take a look at the data to familiarize ourselves with the format and data types. In this example, I'm using some treatment data from the oncology domain, including treatment starts and the drugs patients are getting.**

# Importer les bibliothèques nécessaires
Avant de travailler avec des bibliothèques comme Pandas ou Numpy, il faut les importer ; et avant même cette étape, il faut installer ces bibliothèques. Si ce n’est pas encore fait sur votre machine, voici donc des [instructions](https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html) pour procéder à l’installation. Une fois que c’est fait, nous pouvons les importer 

In [None]:
# we first import series from pandas
from pandas import Series

- **Note we can create a series from a list**

In [None]:
# let's first creat a serie
mylist = [20, 20, 30, 40]

s = Series(mylist)
s

In [None]:
s[2]

In [None]:
# you can also et your own index because the default index start from 0 till n
ss = Series(
    [12, 13, 7, 80, 95], index=list("abcde")
)  # not we can use index=['a','b',..]
ss


# # you can also change the datatype to float or unsigned int
# Series(
#     [20.5, 12, 34, 56, 100],
#     index=["a", "b", "c", "d", "e"],
# )

In [None]:
# On peux changer les type de donne par defaut c'est None
s1 = Series([12, 13, 7, 80, 95], index=list("abcde"), dtype="int8")
s1

In [None]:
# we can use index of integer of our choice for that we use the range function
ex1 = Series([5, 3, 7, 8, 19], index=range(10, 15))
ex1

## Sclicing series

In [None]:
# using the above example we want to get the value at index 10 we can do following
ex1[10]  # output should be 5

In [None]:
# using the string base index we can ge the same by using "" inside the bracket
# if we want to get the value at index e we can do
print(ss)
print("Value at index e is {}".format(ss["e"]))

In [None]:
# you can also add element to your series. Let's say I want to add 200 in ss
ss["f"] = 200
ss

In [None]:
# use can use comparaison operatiors on series
ss > 12


# # To make look nice and more readable this will return a series of integer
# ss[ss < 13]

In [None]:
# What will be difference between the following two lines(list and array)

l = [12, 13, 7, 80, 95] * 2
print(l)
print("\n\n")

print(ss * 2)

### Dataframe

In [None]:
from pandas import DataFrame

In [None]:
data = {
    "capital": [
        "Delhi",
        "Delhi",
        "Delhi",
        "Delhi",
        "Delhi",
        "Conakry",
        "Conakry",
        "Conakry",
        "Conakry",
        "Conakry",
        "washington",
        "washington",
        "washington",
        "washington",
        "washington",
    ],
    "year": [
        2001,
        2004,
        2007,
        2010,
        2015,
        2001,
        2005,
        2008,
        2011,
        2019,
        2001,
        2003,
        2007,
        2009,
        2017,
    ],
    "pop": [
        2.45,
        2.99,
        3.01,
        3.50,
        4.24,
        2.47,
        2.73,
        2.85,
        2.99,
        3.11,
        2.11,
        3.00,
        3.67,
        3.73,
        3.97,
    ],
}

df = DataFrame(data)
df

In [None]:
# we can see the capital whcich are not guinea
df[df["capital"] != "Conakry"]

In [None]:
# create a new dataframe where year is greater than 2010

df[df["year"] > 2010]

In [None]:
# we can also slice a dataframe for example if we want to capital and year only we can do following

df[
    ["capital", "year"]
].head()  # the head function will return the first 5 rows of your dataframe

In [None]:
df.to_csv("workshop.csv", index=False)

# Part 3: Loading and inspecting a (csv)

Before we can start answering questions about the data we need to do a little bit of exploratory analysis.The first thing we need to do when working with a new dataset is to get an idea of what the data looks like. We start by loading the data into memory. Pandas comes with a built-in `read_csv` function that we can use to read CSV files and load them directly to a pandas `DataFrame` object. 
- **Note the dataset is on my github account** [dataset](https://raw.githubusercontent.com/abdoulayegk/ml-workshop/main/kidney_disease.csv)

In [None]:
# We need to import the libraries to start with
import warnings

import matplotlib.pyplot as plt
import missingno as ms
import numpy as np
import pandas as pd
import seaborn as sns

warnings.filterwarnings("ignore")
# This command makes charts show inline in a notebook
%matplotlib inline
plt.style.use("ggplot")

# Making the figures show up a little larger than default size
plt.rcParams["figure.figsize"] = [10, 6]

### Loading real world dataset

In [None]:
# Read data from a CSV into a dataframe
# This is the data we're going to be working with!
df = pd.read_csv("kidney_disease.csv")

In [None]:
# Just typing the name of the dataframe will print the entire output
# If there are too many rows, Jupyter will print the top few and
# bottom few rows with a "..." to indicate that there are more rows
df

# Data Set Information:

We use the following representation to collect the dataset 
1. age - age
2. bp - blood pressure
3. sg - specific gravity
4. al - albumin
5. su - sugar
6. rbc - red blood cells
7. pc - pus cell
8. pcc - pus cell clumps
9. ba - bacteria
10. bgr - blood glucose random
11. bu - blood urea
12. sc - serum creatinine
13. sod - sodium
14. pot - potassium
15. hemo - hemoglobin
16. pcv - packed cell volume
17. wc - white blood cell count
18. rc - red blood cell count
19. htn - hypertension
20. dm - diabetes mellitus
21. cad - coronary artery disease
22. appet - appetite
23. pe - pedal edema
24. ane - anemia
25. class - class
26. id

## Inspecting a dataframe using built-in functions
* Most operations on a dataframe happen by applying a function to it using the "." notation, e.g. `my_dataframe.do_something()`
* Let's look at some simple functions that we can apply to Pandas dataframes

**Note**: It is very important to give your columns name a meaningful names.

In [None]:
# let's see the columns name of our dataset
df.columns.to_list()

In [None]:
# The head(n) function shows the first n rows in a dataframe. les 5 premiere Rangees de notre table
# If no n is specified, it defaults to 5  Rangee.
df.head()

In [None]:
# You can also use the sample() function to get n random rows in
# the dataframe
df.sample(5)

In [None]:
# This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage
# Let's talk about the # column later!
df.info()

In [None]:
# the variable classification is our target so let's rename it.
df = df.rename(columns={"classification": "target"})

In [None]:
# The describe function shows some basic statistics for numeric columns
# We only have one here (Dosage), so this isn't very interesting
df.describe().T

In [None]:
# now let's see the shape of our dataset Nombre de rangees et numbere de columns
df.shape

## Other ways to inspect a dataframe
* There are other operations you can do on a dataframe that don't follow the function notation
* Let's look at a few examples:
1. len(df)
2. df.dtypes, etc


## <span style="color:blue">*** DIY exercise ***</span>
Create a new cell below and print the first ten rows of the "df" dataframe.

# Part 4: Data Exploration
Let's assume we've loaded the treatment related data from our dataset in order to provide them with some analytical insights around the types of disease a patient has.

## Accessing columns in a dataframe

In [None]:
# let's plot a graph to see missing values in our dataset
ms.matrix(df)

In [None]:
# Check the type to show that this indeed returns a Series object
type(df["wc"])

In [None]:
# And this is how you access two columns of a dataframe.
# Note that this will return a dataframe again, not a series
# (because a series has only one column...)
# Also note the double square brackets
# because you're passing a *list* of columns as an argument
df[["wc", "pcc"]].head()

## <span style="color:blue">*** DIY exercise ***</span>
Create a new cell below and print the list of unique pc in the dataframe.

## Accessing rows in a dataframe
In addition to slicing by column, we often want to get the record where a column has a specific value, e.g. a specific age here. This can be done using the `.loc` function syntax and a boolean statement:

### Loc

In [None]:
# Access the record(s) where the value in the PatientID column is PT20
df.loc[df["age"] == 20]

In [None]:
# You can also use boolean conditions in the selector
df.loc[(df["age"] == 20) & (df["pc"] == "normal")]

### iloc

In [None]:
df.iloc[:, [2, 3]]

In [None]:
df.iloc[[0, 2], [1, 3]]

In [None]:
df.iloc[:, lambda df: [0, 2]]

**loc()** fait référence  l’étiquette.<br>
**iloc()** fait référence à l’indice de position.

## Sorting dataframes
Sorting the output of a dataframe can be helpful for visually inspecting or presenting data! Sorting by one or multiple columns is super easy using the `sort_values` function:

In [None]:
# Sort by earliest treatment start date, i.e. in ascending order (default)
df.sort_values("age").head()

-**Note you can use ascending=False to sort in descending order also you can sort a whole DataFrame**


In [None]:
# Use the inplace keyword to modify the dataframe
# Note that you can also sort by a list of columns
df.sort_values(["id", "age"], inplace=True)

In [None]:
# we can use replace as you can see here we are using dictionary
df["target"].replace({"ckd": 1, "notckd": 0}, inplace=True)

in this case the change that we made is temporary because we didn't change in the official data

In [None]:
# use inplace=true to make the change in the original dataset

In [None]:
sns.catplot(x="target", kind="count", data=df)

In [None]:
df.rc.unique()

You can from the above cell that we are having a string of numbers that's why we we're having dtype as object we have to convert it to the appropriate format

In [None]:
df.pcv.unique()

In [None]:
df.wc.unique()

In [None]:
df.sg.unique()

In [None]:
df.wc.unique()

 Our rc column is an object also we have some missing values and some thing that we don't really know like ?. we are going to replace all my the mean in this case and then we will fill missing values.

In [None]:
# To replace the string caracters with NaN
df.rc.replace("?", np.nan, inplace=True)
df.wc.replace(("?"), np.nan, inplace=True)
df.pcv.replace(("?"), np.nan, inplace=True)

In [None]:
df.rbc.unique()

if you notice age is of float type so it's good to convert it to int but for not let's leave it as it it is.

In [None]:
sns.distplot(df.age)

#### These are still object so we have to convert it to numerical.
1. pcv        
2. wc         
3. rc         

In [None]:
# now we have to change the datatype of pcv, wc and rc
df.wc = df.wc.astype("float64")
df.rc = df.rc.astype("float64")
df.pcv = df.pcv.astype("float64")

In [None]:
# # Now let's fill missing values
# df.age.fillna(df.age.mean(), inplace=True)
# df.bp.fillna(df.bp.mean(), inplace=True)
# df.sg.fillna(df.sg.mean(), inplace=True)
# df.al.fillna(df.al.mean(), inplace=True)
# df.su.fillna(df.su.mode(), inplace=True)
# df.wc.fillna(df.wc.mean(), inplace=True)
# df.pcv.fillna(df.pcv.mean(), inplace=True)
# df.rc.fillna(df.rc.mean(), inplace=True)
# df.age.fillna(df.age.mean(), inplace=True)
# df.al.fillna(df.al.mode(), inplace=True)
# df.su.fillna(df.su.mean(), inplace=True)
# df.pot.fillna(df.pot.mean(), inplace=True)
# df.bu.fillna(df.bu.mean(), inplace=True)
# df.sod.fillna(df.sod.mean(), inplace=True)
# df.hemo.fillna(df.hemo.mean(), inplace=True)
# df.sc.fillna(df.sc.mean(), inplace=True)
# df.bgr.fillna(df.bgr.mean(), inplace=True)

**NB** Il ya plusieure facons de remplace les valuers manquantes:
1. fillna(0)
2. ffill()
3. bfill(), etc

In [None]:
df.isna().sum()

In [None]:
df.rbc.unique()
df.rbc = df.rbc.map({"normal": 1, "abnormal": 0})
df.pc = df.pc.map({"normal": 1, "abnormal": 0})
df.pcc = df.pcc.map({"present": 1, "notpresent": 0})
df.ba = df.ba.map({"present": 1, "notpresent": 0})
df.htn = df.htn.map({"yes": 1, "no": 0})
df.dm = df.dm.map({"yes": 1, "no": 0})
df.cad = df.cad.map({"yes": 1, "no": 0})
df.appet = df.appet.map({"good": 1, "poor": 0})
df.pe = df.pe.map({"yes": 1, "no": 0})
df.ane = df.ane.map({"yes": 1, "no": 0})

**NB** Sklearn provide a nice way of encoding features:
- OneHotEncoder [Reference](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)


**Pandas also provide one nice dummy encoder function**
- pd.get_dummes() [Reference](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)

In [None]:
df.htn.unique()

In [None]:
df.head()

In [None]:
df.isna().sum()

In [None]:
df.age.plot(kind="hist", bins=20)
plt.show()

In [None]:
print(df.target.value_counts(normalize=True) * 100)

In [None]:
# Explore appetite  vs target

plt.figure(figsize=(16, 6))
sns.countplot(x="appet", hue="target", data=df)
plt.xticks(fontweight="light", fontsize="x-large");

In [None]:
# Explore pc(pus cell)  vs target

plt.figure(figsize=(16, 6))
sns.countplot(x="pc", hue="target", data=df)
plt.xticks(fontweight="light", fontsize="x-large")
plt.show()

### Boxplot: boxplot is a method for graphically depicting groups of numerical data through their quartiles.

In [None]:
sns.boxplot(x=df.age)

In [None]:
# from the above cess we can see that we have outliers(extrime values). we can visialize a boxplot as a way to see outliers on our data

In [None]:
sns.heatmap(df.corr(), annot=True)

### what is the total number of people who's age is greater than 20 and are suffering from the disease?

In [None]:
len(df[(df["age"] > 20) & (df["target"] == 1)])

We can see that 233 people have their age greater than 20 and are also suffering from the disease

### what is the number of people who's appetite is good but suffering from the disease?

In [None]:
len(df[(df["appet"] == 1) & (df["target"] == 1)])

It appear to be 168 people with good appetite but suffering from the disease

## Query
Query the columns of a DataFrame with a boolean expression.

inplace: bool
Whether the query should modify the data in place or return a modified copy.


In [None]:
df.query("age > 70")

**Node** we can use this method to create a brand new dataframe or modify the original dataframe by using **inplace=True**.

# Model building

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, StandardScaler

![](ml.png)

In [None]:
df.head()

### Using  pandas get_dummies
I don't always recommand using this you can find the reasons behind it here:[link]('https://stackoverflow.com/questions/36631163/what-are-the-pros-and-cons-between-get-dummies-pandas-and-onehotencoder-sciki#56567037')

In [None]:
# Encoding the remainig features with pandas get_dummies
# df = pd.get_dummies(df)

In [None]:
# we are going to select the features and the target variable.
"""
Note that X is n dimmentional array or dataframe
y: 1D array or Series
"""
X = df.drop(["id", "target"], 1)
y = df.target

![](img2.png)

In [None]:
X

## Handling missing values
we are going to use SimpleImputer from sklearn which is a great technique then we will scale the data.<br>
**Standardize features by removing the mean and scaling to unit variance.**

The standard score of a sample x is calculated as:

z = (x - u) / s
1. SimpleImputer: https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
2. StandardScaler: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html 

In [None]:
# encoding the target with label encoder
encoder = LabelEncoder()
df.target = encoder.fit_transform(df.target)

In [None]:
# scaling the data
pipeline = Pipeline(
    [("impute", SimpleImputer(strategy="mean")), ("scale", StandardScaler())]
)

X = pd.DataFrame(columns=X.columns, data=pipeline.fit_transform(X))

In [None]:
# X

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

![:scale 80%](split.png)

In [None]:
lgr = LogisticRegression()
lgr.fit(X_train, y_train)
y_pred = lgr.predict(X_test)
y_pred[:10]

![](ml1.png)

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
accuracy_score(y_test, y_pred)

In [None]:
confusion_matrix(y_test, y_pred)

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
cross_val_score(lgr, X, y, scoring="accuracy").mean()

In [None]:
# define lists to collect scores
train_scores, test_scores = list(), list()
# define the tree depths to evaluate
values = [i for i in range(1, 51)]
# evaluate a decision tree for each depth
for i in values:
    # configure the model
    # model = KNeighborsClassifier(n_neighbors=i)
    model = LogisticRegression()
    # fit model on the training dataset
    model.fit(X_train, y_train)
    # evaluate on the train dataset
    train_yhat = model.predict(X_train)
    train_acc = accuracy_score(y_train, train_yhat)
    train_scores.append(train_acc)
    # evaluate on the test dataset
    test_yhat = model.predict(X_test)
    test_acc = accuracy_score(y_test, test_yhat)
    test_scores.append(test_acc)
    # summarize progress
    print(">%d, train: %.3f, test: %.3f" % (i, train_acc, test_acc))
# plot of train and test scores vs number of neighbors
plt.plot(values, train_scores, "-o", label="Train")
plt.plot(values, test_scores, "-o", label="Test")
plt.legend()
plt.show()

### Can take another example with KNN if we want.

**In Many cases you would want use different machine learning models and also apply different preprocessing techniques I highly encourage you using crossvalidaton to avoid model overfitting** 

# Part 4: Summary!

We hope this workshop was useful for you. We've only touched on some of the **basic concepts** of Pandas, but we believe this will give you the foundations to keep exploring the data! We covered:

- Basic operations in Jupyter notebooks
- Dataframes and Series in Pandas, and loading data to a dataframe
- Basic data inspection (head, describe, dtypes, accessing columns and rows, sorting)
- count, nunique
- Indexing in dataframes and reset_index
- Plotting (bar plots, hist plots, boxplot, heatmap)
- Model building and evaluation

**I would appreciate your feedback** <br>
Email: abdoulayegnbalde@gmail.com <br>
github: abdoulayegk <br>
twitter: @abdoulayegk