# Intro to Data Science and Analysis with Pandas & Jupyter, September 2021

Workshop lead: Abdoulaye Balde [@abdoulayegk](http://twitter.com/abdoulayegk)<br>
Notebook will be  [abdulayegk]()

# Overview
The goal of this workshop is to give learners an intro to data science and analysis with Python using Pandas and Jupyter. 
We will first go through a general overwiew of python such as list, tuple and dictionary.
then we go through the process of loading data from CSV files, inspecting and cleaning the data. As a second step, we will analyse the data and draw some insights about Chronic-kidney dataset. 

The workshop is structured as follows:

- Intro and background
- Part 0: Quick Jupyter exercise
- Part 1: General overview of python
- Part 2: Loading and inspecting data
- Part 3: Data analysis
- Part 4: Summary

**Note that this workshop is only intended as an introduction to some basic concepts of python for data science using Pandas. It is in no means intended to be comprehensive, and there are a lot of useful functions a beginner needs to know to do in-depth data analysis. I hope that this workshop sets you up for self-guided learning to master the full range of necessary Pandas tools.**

## How to follow along with the workshop
- You can run every cell in the notebook as we go along using the shortcut Shift+Enter
- You will encounter a few <span style="color:blue">*** DIY exercise ***</span> blocks where you'll get a few minutes to try out what you've just learned
- Feel free to save and download your notebook from Binder at the end since Binder deletes notebooks after 12 hours.

# Intro

## What is Jupyter (and the Jupyter ecosystem...)?
- **IPython** is an **interactive Python shell** (just type "ipython" to start it)
- **Jupyter** is a Python library that provides a **web-based UI** on top of ipython to create notebooks with code and output
- **JupyterLab** provides some additional **features on top of Jupyter**, e.g. a file browser
- **Binder** is a **web-based hub** for containers that contain your Python environment and renders notebooks based on a git repo

## Quick overview of python list, tupl eand  dictionary
- **List**  A list is a data structure in Python that is a mutable, or changeable, ordered sequence of elements. Each element or value that is inside of a list is called an item. Lists are defined by having values between square brackets [ ].
 - **Tuple** A tuple is a data structure that is an immutable, or unchangeable, ordered sequence of elements. Because tuples are immutable, their values cannot be modified. Tuples have values between parentheses ( ) separated by commas.
 
- **Dictionary** The dictionary is Python’s built-in mapping type. Dictionaries map keys to values and these key-value pairs provide a useful way to store data in Python.

    Typically used to hold data that are related, such as the information contained in an ID or a user profile, dictionaries are constructed with curly braces on either side { }.

## What is Pandas/Matplotlib/Pyplot/Seaborn?

- **Pandas** is a Python library for **data manipulation and analysis**. It offers data structures and operations for manipulating numerical tables and time series.
- **Matplotlib** is a Python **2D plotting library**. Pyplot is a collection of command style functions in matplotlib that make matplotlib work like MATLAB. While we mostly use Seaborn, we sometimes fall back to using Pyplot functions for certain aspects of plotting.
- **Seaborn** is a Python **data visualization** library based on matplotlib. It's kind of like a nicer version of Pyplot.
- You can **use Pandas code in a regular Python script** of course. I'm just combining Jupyter + Pandas in this tutorial because notebooks are a great way to immediately see output!

### Notebooks are basically just interactive ipython terminals, often mixed in with markdown text:
- Each input field you see is called a **cell**
- Cells can be **either code or markdown**
- You can execute any kind of Python code
- **Variables persist** between cells
- The notebook **doesn't care about the order of cells**, just the order of executing it in order to remember variables. However, "run all" executes your cells top to bottom.

### Notebooks have **two modes**: a) editing the cells and b) navigating the notebook (command mode):
- You can **navigate** around the notebook in command mode by clicking cells or using the arrow keys
- Depending on the environment you're using (Jupyter notebook, Jupyter lab, Google Colab...) there will be a different **visual cue** (e.g. a colored line) to indicate the mode a cell is in
- In order to **edit a cell**, you can press **Enter** or double-click it.
- To **execute** the cell content, press Shift+Enter to run the cell
- To get **out of edit mode** and back into navigation mode, press the **Escape key**

### Some helpful keyboard shortcuts:
- The **default type for a cell is code**. In command mode, press *m* to make a cell markdown and *y* to make it code
- Press *a* in command mode to create a new cell *above* the current one
- Press *b* in command mode to create a new cell *below* the current one
- *Tab* autocompletes methods (like in IPython)
- *Shift+Tab* shows you the docstring for the outer function of the line your cursor is in
- Press *dd* in command mode to delete a cell. 
- *Cmd+z* undoes operations in the highlighted cell, *z* undoes cell operations in the notebook (e.g. deleting a cell)

In [1]:
# Example
print("Welcome to Google Developer Student Clubs!")

Welcome to Google Developer Student Clubs!


# Part 1: General Overview of python (15 mins)
In this part we are going to go through the basics things we need to know before loading data for that we are going to start from looping in python and we will go till classes in python.<br>
**Note this will be just very basics things we should know to follow along in this workshop if you want to go in deep then you should get a book for that**

In [3]:
# To print your name in python
print("Abdoulaye Balde")
9+10

Abdoulaye Balde


19

In [4]:
# Arithmetic operators 
a = 4
b = 10
result = a + b
print("The result after adding {} and {} is {}".format(a, b, result))

The result after adding 4 and 10 is 14


From the above cell we can use all other arithmetic operators like substruction(-), multiplication(*) and division(/) this is and exercice for you to apply these three using different variables and assigning it to different numbers.

In [5]:
# Python Conditions if else
age = 10

if age > 10:
    print("Greater than 10")
elif age == 10:
    print("Age equal to 10")
else:
    print("Less than 10")

Age equal to 10


In [6]:
# Looping in python
## Using for loop: we want to print digits from 0 to 10 
for i in range(10):
    print("The value of i at {} iteration is {}".format(i+1, i))


The value of i at 1 iteration is 0
The value of i at 2 iteration is 1
The value of i at 3 iteration is 2
The value of i at 4 iteration is 3
The value of i at 5 iteration is 4
The value of i at 6 iteration is 5
The value of i at 7 iteration is 6
The value of i at 8 iteration is 7
The value of i at 9 iteration is 8
The value of i at 10 iteration is 9


In [7]:
# we can get the same thing using while loop
i = 0
while i < 10:
    print("The value of i at {} iteration is {}".format(i+1, i))
    i+=1
    

The value of i at 1 iteration is 0
The value of i at 2 iteration is 1
The value of i at 3 iteration is 2
The value of i at 4 iteration is 3
The value of i at 5 iteration is 4
The value of i at 6 iteration is 5
The value of i at 7 iteration is 6
The value of i at 8 iteration is 7
The value of i at 9 iteration is 8
The value of i at 10 iteration is 9


**Note we can use also <spam> Break</spam> or <spam> continue</spam> to control our loops** <br>

In [10]:
# print the the fist 5 numbers from a loop that run 10 times
i = 1
while i < 10:
    print(i)
    if i == 5:
        break
    i += 1

1
2
3
4
5


In [11]:
# this example is with continue. This will skip 5 while printing the output
i = 0
while i < 10:
    i += 1
    if i == 5:
        continue
    print(i)

1
2
3
4
6
7
8
9
10


#### List, tuple and Dictionary

In [None]:
# List
a = []   # an empty list
print(a)

print("\n\n") # leave two lines blank
b = [1,2,3,4]  # list of numbers
print(b)
print("\n\n")
fruits = ['Orange', 'Banana', "Apple"] # list of fruits
print(fruits)

print("\n\n")
# you can mix list using different datatypes
mylist = [1, "Pineapple", 3.14, [2,3,4]]
print(mylist)

In [None]:
#Tuple
t = ()  # an empty tuple
print(t)


mytuple1 = (1,2,3,4,5)
print(mytuple1)


In [None]:
## Dictionary
dic = {}  # an empty dictionary

mydic = {'username': 'abdoulayegk', 'online': True, 'followers': 987}
print(mydic)

In [None]:
mydic['username']

In [None]:
print(mydic['followers'])
# Returns 987

print(mydic['online'])
# Returns True

In [None]:
# to print the key values pair
for key, value in mydic.items():
    print(key, 'is the key for the value', value)

Using Methods to Access Elements <br>
In addition to using keys to access values, we can also work with some built-in methods:<br>

dict.keys() isolates keys <br>
dict.values() isolates values<br>
dict.items() returns items in a list format of (key, value) tuple pairs<br>

In [None]:
# To return all the keys of our dictionary
print(mydic.keys())


In [None]:
# To return the values of a dictionary
print(mydic.values())

In [None]:
# defining function in python

def greetings(name):
    return "Hello "+ name


print(greetings("Abdoulaye"))

# another expample to find factorial of number

# def factorial(n:int)->int:
#     if n == 0:
#         return 1
#     else:
#         return n*factorial(n-1)


# if __name__ == '__main__':
#     print(factorial(5))

In [None]:
# class in python

class Greetings:
    def __init__(self, name, age):
        self.name = name
        self.age = age
        
    def greet_someone(self):
        print("Your name is {} and you are {}". format(self.name, self.age))

        
greet = Greetings("Abdoulaye", 23)
greet.greet_someone()

# Part 1: Loading and inspecting the data (15 mins)

Before we can start answering questions about the data we need to do a little bit of exploratory analysis.The first thing we need to do when working with a new dataset is to get an idea of what the data looks like. We start by loading the data into memory. Pandas comes with a built-in `read_csv` function that we can use to read CSV files and load them directly to a pandas `DataFrame` object. 
- **Note the dataset is on my github account** [dataset](https://raw.githubusercontent.com/abdoulayegk/ml-workshop/main/kidney_disease.csv)

In [None]:
# We need to import the libraries to start with
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# This command makes charts show inline in a notebook
%matplotlib inline

# Making the figures show up a little larger than default size
plt.rcParams['figure.figsize'] = [10,6]

## What is a dataframe?
* A **dataframe** is a **2-dimensional labeled data structure** with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used Pandas object. 
* Pandas borrows the concept of DataFrame from the statistical programming language R.
* There are a lot of **different ways to read data** into a dataframe - from lists, dicts, CSVs, databases... In this example, we're loading data from a CSV file!

**Let's take a look at the data to familiarize ourselves with the format and data types. In this example, I'm using some treatment data from the oncology domain, including treatment starts and the drugs patients are getting.**

In [None]:
# Read data from a CSV into a dataframe
# This is the data we're going to be working with!
df = pd.read_csv("kidney_disease.csv")


In [None]:
# Just typing the name of the dataframe will print the entire output
# If there are too many rows, Jupyter will print the top few and 
# bottom few rows with a "..." to indicate that there are more rows
df

# Data Set Information:

We use the following representation to collect the dataset 
1. age - age
2. bp - blood pressure
3. sg - specific gravity
4. al - albumin
5. su - sugar
6. rbc - red blood cells
7. pc - pus cell
8. pcc - pus cell clumps
9. ba - bacteria
10. bgr - blood glucose random
11. bu - blood urea
12. sc - serum creatinine
13. sod - sodium
14. pot - potassium
15. hemo - hemoglobin
16. pcv - packed cell volume
17. wc - white blood cell count
18. rc - red blood cell count
19. htn - hypertension
20. dm - diabetes mellitus
21. cad - coronary artery disease
22. appet - appetite
23. pe - pedal edema
24. ane - anemia
25. class - class
26. id

## Inspecting a dataframe using built-in functions
* Most operations on a dataframe happen by applying a function to it using the "." notation, e.g. `my_dataframe.do_something()`
* Let's look at some simple functions that we can apply to Pandas dataframes

In [None]:
# let's see the columns name of our dataset
df.columns.to_list()

In [None]:
# The head(n) function shows the first n rows in a dataframe.
# If no n is specified, it defaults to 5 rows.
df.head()

In [None]:
# You can also use the sample() function to get n random rows in 
# the dataframe
df.sample(5)

In [None]:
# The info() function prints some basic information about the dataframe
# such as the number of columns and rows
# Let's talk about the # column later!
df.info()

In [None]:
# the variable classification is our target so let's rename it.
df = df.rename(columns = {"classification":'target'})

In [None]:
# The describe function shows some basic statistics for numeric columns
# We only have one here (Dosage), so this isn't very interesting
df.describe().T

In [None]:
# now let's see the shape of our dataset
df.shape

## Other ways to inspect a dataframe
* There are other operations you can do on a dataframe that don't follow the function notation
* Let's look at a few examples:
1. len(df)
2. df.dtypes, etc


## <span style="color:blue">*** DIY exercise ***</span>
Create a new cell below and print the first ten rows of the "df" dataframe.

## Accessing columns in a dataframe

In [None]:
# Return the PatientID column as a Series
df['id'].head()

In [None]:
# Check the type to show that this indeed returns a Series object
type(df['wc'])

In [None]:
# And this is how you access two columns of a dataframe.
# Note that this will return a dataframe again, not a series 
# (because a series has only one column...)
# Also note the double square brackets 
# because you're passing a *list* of columns as an argument
df[['wc', 'pcc']].head()

In [None]:
# This way we can now do some more data exploration, 
# e.g. looking at unique patient IDs using the unique function
# which returns an array of values
df['target'].unique()

## <span style="color:blue">*** DIY exercise ***</span>
Create a new cell below and print the list of unique pc in the dataframe.

In [None]:
df.pc.unique()

## Accessing rows in a dataframe
In addition to slicing by column, we often want to get the record where a column has a specific value, e.g. a specific age here. This can be done using the `.loc` function syntax and a boolean statement:

In [None]:
# Access the record(s) where the value in the PatientID column is PT20
df.loc[df['age'] == 20]

In [None]:
# This is equivalent to the following shorter notation
# I prefer to always use loc to be more explicit
df[df['age'] == 20]

In [None]:
# You can also use boolean conditions in the selector
df.loc[(df['age'] == 20) & (df['pc'] == 'normal')]

## Sorting dataframes
Sorting the output of a dataframe can be helpful for visually inspecting or presenting data! Sorting by one or multiple columns is super easy using the `sort_values` function:

In [None]:
# Sort by earliest treatment start date, i.e. in ascending order (default)
df.sort_values('age').head()

-**Note you can use ascending=False to sort in descending order also you can sort a whole DataFrame**


In [None]:
# Use the inplace keyword to modify the dataframe
# Note that you can also sort by a list of columns
df.sort_values(['id', 'age'], inplace=True)

In [None]:
df.target.unique()

# Part 2: Data cleaning (10 mins)

In [None]:
#we can use replace
df['target'].replace({'ckd':1, 'notckd':0})

in this case the change that we made is temporary because we didn't change in the official data

In [None]:
# use inplace=true to make the change in the original dataset

In [None]:
# Check if there any missing values in train set
ax = df.isna().sum().sort_values().plot(kind = 'barh', figsize = (9, 10))
plt.title('Percentage of Missing Values Per Column in Train Set', fontdict={'size':15})
for p in ax.patches:
    percentage ='{:,.0f}%'.format((p.get_width()/df.shape[0])*100)
    width, height =p.get_width(),p.get_height()
    x=p.get_x()+width+0.02
    y=p.get_y()+height/2
    ax.annotate(percentage,(x,y))

In [None]:
sns.catplot(x="target", kind="count", data= df)

In [None]:
df.rc.unique()

In [None]:
df.pcv.unique()

In [None]:
df.wc.unique()

In [None]:
df.sg.unique()

In [None]:
df.wc.unique()

 Our rc column is an object also we have some missing values and some thing that we don't really know like ?. we are going to replace all my the mean in this case and then we will fill missing values.

In [None]:
# To replace the string caracters with NaN
df.rc.replace('?', np.nan, inplace=True)
df.wc.replace(('?'), np.nan, inplace=True)
df.pcv.replace(('?'), np.nan, inplace=True)

In [None]:
df.rbc.unique()

In [None]:
# Encoding categorial variables
df['target'] = pd.get_dummies(df['target'])

df['rbc'] = pd.get_dummies(df['rbc'])
df['pc'] = pd.get_dummies(df['pc'])
df['pcc'] = pd.get_dummies(df['pcc'])
df['ba'] = pd.get_dummies(df['ba'])
df['htn'] = pd.get_dummies(df['htn'])
df['dm'] = pd.get_dummies(df['dm'])
df['cad'] = pd.get_dummies(df['cad'])
df['appet'] = pd.get_dummies(df['appet'])
df['pe'] = pd.get_dummies(df['pe'])
df['ane'] = pd.get_dummies(df['ane'])


In [None]:
df.dtypes

#### These are still object so we have to convert it to numerical.
1. pcv        
2. wc         
3. rc         

In [None]:
# now we have to change the datatype of pcv, wc and rc
df.wc = df.wc.astype('float64')
df.rc = df.rc.astype('float64')
df.pcv = df.pcv.astype('float64')

In [None]:
# Now let's fill missing values
df.age.fillna(df.age.mean(), inplace=True)
df.bp.fillna(df.bp.mean(), inplace=True)
df.sg.fillna(df.sg.mean(), inplace=True)
df.al.fillna(df.al.mean(), inplace=True)
df.su.fillna(df.su.mode(), inplace=True)
df.wc.fillna(df.wc.mean(), inplace=True)
df.pcv.fillna(df.pcv.mean(), inplace=True)
df.rc.fillna(df.rc.mean(), inplace=True)
df.age.fillna(df.age.mean(), inplace=True)
df.al.fillna(df.al.mode(), inplace=True)
df.su.fillna(df.su.mean(), inplace=True)
df.pot.fillna(df.pot.mean(), inplace=True)
df.bu.fillna(df.bu.mean(), inplace=True)
df.sod.fillna(df.sod.mean(), inplace=True)
df.hemo.fillna(df.hemo.mean(), inplace=True)
df.sc.fillna(df.sc.mean(), inplace=True)
df.bgr.fillna(df.bgr.mean(), inplace=True)


In [None]:
df.isna().sum()