## I. Pandas and Python Tips and Tricks for Data Science and Data Analysis
(https://towardsdatascience.com/pandas-and-python-tips-and-tricks-for-data-science-and-data-analysis-1b1e05b7d93a)

### 1. 𝗖𝗿𝗲𝗮𝘁𝗲 𝗮 𝗻𝗲𝘄 𝗰𝗼𝗹𝘂𝗺𝗻 𝗳𝗿𝗼𝗺 𝗺𝘂𝗹𝘁𝗶𝗽𝗹𝗲 𝗰𝗼𝗹𝘂𝗺𝗻𝘀 𝗶𝗻 𝘆𝗼𝘂𝗿 𝗱𝗮𝘁𝗮𝗳𝗿𝗮𝗺𝗲.

In [5]:
import pandas as pd

# Create the dataframe
candidates= {
    'Name':["Aida","Mamadou","Ismael","Aicha","Fatou", "Khalil"],
    'Degree':['Master','Master','Bachelor', "PhD", "Master", "PhD"],
    'From':["Abidjan","Dakar","Bamako", "Abidjan","Konakry", "Lomé"],
    'Years_exp': [2, 3, 0, 5, 4, 3],
    'From_office(min)': [120, 95, 75, 80, 100, 34]
          }
candidates_df = pd.DataFrame(candidates)

"""
----------------My custom function-------------------
""" 
def candidate_info(row):

  # Select columns of interest 
  name = row.Name 
  is_from = row.From
  year_exp = row.Years_exp
  degree = row.Degree
  from_office = row["From_office(min)"]

  # Generate the description from previous variables
  info = f"""{name} from {is_from} holds a {degree} degree 
              with {year_exp} year(s) experience 
              and lives {from_office} from the office"""

  return info

"""
-------Application of the function to the data ------
"""
candidates_df["Description"] = candidates_df.apply(lambda row: candidate_info(row), axis=1)
candidates_df

Unnamed: 0,Name,Degree,From,Years_exp,From_office(min),Description
0,Aida,Master,Abidjan,2,120,Aida from Abidjan holds a Master degree \n ...
1,Mamadou,Master,Dakar,3,95,Mamadou from Dakar holds a Master degree \n ...
2,Ismael,Bachelor,Bamako,0,75,Ismael from Bamako holds a Bachelor degree \n ...
3,Aicha,PhD,Abidjan,5,80,Aicha from Abidjan holds a PhD degree \n ...
4,Fatou,Master,Konakry,4,100,Fatou from Konakry holds a Master degree \n ...
5,Khalil,PhD,Lomé,3,34,Khalil from Lomé holds a PhD degree \n ...


### 2. Convert categorical data into numerical ones

In [4]:
seniority = ['Entry level', 'Mid level', 'Senior level']
seniority_bins = [0, 1, 3, 5]
candidates_df['Seniority'] = pd.cut(candidates_df['Years_exp'],
                                    bins=seniority_bins, 
                                    labels=seniority, 
                                    include_lowest=True)

candidates_df

Unnamed: 0,Name,Degree,From,Years_exp,From_office(min),Description,Seniority
0,Aida,Master,Abidjan,2,120,Aida from Abidjan holds a Master degree \n ...,Mid level
1,Mamadou,Master,Dakar,3,95,Mamadou from Dakar holds a Master degree \n ...,Mid level
2,Ismael,Bachelor,Bamako,0,75,Ismael from Bamako holds a Bachelor degree \n ...,Entry level
3,Aicha,PhD,Abidjan,5,80,Aicha from Abidjan holds a PhD degree \n ...,Senior level
4,Fatou,Master,Konakry,4,100,Fatou from Konakry holds a Master degree \n ...,Senior level
5,Khalil,PhD,Lomé,3,34,Khalil from Lomé holds a PhD degree \n ...,Mid level


### 3. Select rows from a Pandas Dataframe based on column(s) values

In [7]:
# Get all the candidates with a Master degree
ms_candidates = candidates_df.query("Degree == 'Master'")

# Get non bachelor candidates
no_bs_candidates = candidates_df.query("Degree != 'Bachelor'")

# Get values from list
list_locations = ["Abidjan", "Dakar"]
candiates = candidates_df.query("From in @list_locations")
candiates

Unnamed: 0,Name,Degree,From,Years_exp,From_office(min),Description
0,Aida,Master,Abidjan,2,120,Aida from Abidjan holds a Master degree \n ...
1,Mamadou,Master,Dakar,3,95,Mamadou from Dakar holds a Master degree \n ...
3,Aicha,PhD,Abidjan,5,80,Aicha from Abidjan holds a PhD degree \n ...


### 4. Deal with zip files

In [None]:
import pandas as pd

"""
------------ READ ZIP FILES -----------
"""
# Case 1: read a single zip file 
candidate_df_unzip = pd.read_csv('candidates.csv.zip', compression='zip')

# Case 2: read a file from a folder
from zipfile import ZipFile

# Read the file from a zip folder
sales_df = pd.read_csv(ZipFile("data.zip").open('data/sales_df.csv'))


"""
------------ WRITE ZIP FILES -----------
"""
# Read data from internet
url = "https://raw.githubusercontent.com/keitazoumana/Fastapi-tutorial/master/data/spam.csv"
spam_data = pd.read_csv(url, encoding="ISO-8859-1")

# Save it as a zip file
spam_data.to_csv("spam.csv.zip", compression="zip")

# Check the files sizes
from os import path
path.getsize('spam.csv') / path.getsize('spam.csv.zip')

### 5. Select 𝗮 𝘀𝘂𝗯𝘀𝗲𝘁 𝗼𝗳 𝘆𝗼𝘂𝗿 𝗣𝗮𝗻𝗱𝗮𝘀 𝗱𝗮𝘁𝗮𝗳𝗿𝗮𝗺𝗲 𝘄𝗶𝘁𝗵 𝘀𝗽𝗲𝗰𝗶𝗳𝗶𝗰 𝗰𝗼𝗹𝘂𝗺𝗻 𝘁𝘆𝗽𝗲𝘀

In [None]:
# Import pandas library
import pandas as pd

# Read my dataset
candidates_df = pd.read_csv("./data/candidates_data.csv")

# Check the data columns' types
candidates_df.dtypes

# Only select columns of type "object" & "datetime"
candidates_df.select_dtypes(include = ["object", "datetime64"])

# Exclude columns of type "datetime" & "int"
candidates_df.select_dtypes(exclude = ["int64", "datetime64"])

### 6. Remove comments from Pandas dataframe column

In [None]:
import pandas as pd

# Read my messy dataset
messy_df = pd.read_csv("./data/candidates_data.csv")

# FIRST SCENARIO -> REMOVE COMMENTS
clean_df = pd.read_csv("./data/candidates_data.csv", comment='#')

# SECOND SCENARIO -> CREATE NEW COLUMN FOR COMMENTS
messy_df[['application_date', 'comment']] = messy_df['application_date'].str.split('#', 1, expand=True)

### 7. Print Pandas dataframe in Tabular format from consol 

In [11]:
# Import pandas library
import pandas as pd

data_URL = "https://raw.githubusercontent.com/keitazoumana/Experimentation-Data/main/vgsales.csv" 

# Read your dataframe
video_game_data = pd.read_csv(data_URL)

"""
Printing without to_string() function
"""
print(video_game_data.head())

"""
Printing with to_string() function
"""
print(video_game_data.head().to_string())

   Rank                      Name Platform    Year         Genre Publisher  \
0     1                Wii Sports      Wii  2006.0        Sports  Nintendo   
1     2         Super Mario Bros.      NES  1985.0      Platform  Nintendo   
2     3            Mario Kart Wii      Wii  2008.0        Racing  Nintendo   
3     4         Wii Sports Resort      Wii  2009.0        Sports  Nintendo   
4     5  Pokemon Red/Pokemon Blue       GB  1996.0  Role-Playing  Nintendo   

   NA_Sales  EU_Sales  JP_Sales  Other_Sales  Global_Sales  
0     41.49     29.02      3.77         8.46         82.74  
1     29.08      3.58      6.81         0.77         40.24  
2     15.85     12.88      3.79         3.31         35.82  
3     15.75     11.01      3.28         2.96         33.00  
4     11.27      8.89     10.22         1.00         31.37  
   Rank                      Name Platform    Year         Genre Publisher  NA_Sales  EU_Sales  JP_Sales  Other_Sales  Global_Sales
0     1                Wii Sports

### 8. Highlight data points in Pandas

In [12]:
import pandas as pd

my_info = {
    "Salary": [100000.2, 95000.9, 103000.2, 65984.1, 150987.08], 
    "Height": [6.5, 5.2, 5.59, 6.7, 6.92], 
    "weight": [185.23, 105.12, 110.3, 190.12, 200.59]      
}
my_data = pd.DataFrame(my_info)

"""
Function to highlight min and max
"""

def highlight_min_max(data_frame, min_color, max_color):

  # This first line create a styler object
  final_data = data_frame.style.highlight_max(color = max_color)

  # On this second line, no need to use ".style"
  final_data = final_data.highlight_min(color = min_color)

  return final_data
  
# Function to apply ORANGE to min and GREEN to max
highlight_min_max(my_data, min_color='orange', max_color='green')


"""
Custom function: apply RED or GREEN whether data is below or above the mean. 
"""
def highlight_values(data_row):
  low_value_color = "background-color:#C4606B  ; color: white;"
  high_value_color = "background-color: #C4DE6B; color: white;"   
  filter = data_row < data_row.mean()

  return [low_value_color if low_value else high_value_color for low_value in filter]
  
# Application of my custom function to only 'Height' & 'weight'
my_data.style.apply(highlight_values, subset=['Height', 'weight'])

Unnamed: 0,Salary,Height,weight
0,100000.2,6.5,185.23
1,95000.9,5.2,105.12
2,103000.2,5.59,110.3
3,65984.1,6.7,190.12
4,150987.08,6.92,200.59


### 9. Reduce decimal points in your data

In [3]:
long_decimals_info = {
    "Salary": [100000.23400000, 95000.900300, 103000.2300535, 65984.14000450, 150987.080345], 
    "Height": [6.501050, 5.270000, 5.5900001050, 6.730001050, 6.92100050], 
    "weight": [185.23000059, 105.1200099, 110.350003, 190.12000000, 200.59000000]      
}

long_decimals_df = pd.DataFrame(long_decimals_info)

"""
Format the data with 2 decimal places
"""
fewer_decimals_df = long_decimals_df.round(decimals=2)
fewer_decimals_df

Unnamed: 0,Salary,Height,weight
0,100000.23,6.5,185.23
1,95000.9,5.27,105.12
2,103000.23,5.59,110.35
3,65984.14,6.73,190.12
4,150987.08,6.92,200.59


### 10. Replace some values in your data frame

In [14]:
import pandas as pd
import numpy as np

candidates_info = {
    'Full_Name':["Aida Kone","Mamadou Diop","Ismael Camara","Aicha Konate",
                 "Fanta Koumare", "Khalil Cisse"],
    'degree':['Master','MS','Bachelor', "PhD", "Masters", np.nan],
    'From':[np.nan,"Dakar","Bamako", "Abidjan","Konakry", "Lomé"],
    'Age':[23,26,19, np.nan,25, np.nan],
          }

candidates_df = pd.DataFrame(candidates_info) 

"""
Replace Masters, Master by MS
"""
degrees_to_replace = ["Master", "Masters"]
candidates_df.replace(to_replace = degrees_to_replace, value = "MS", inplace=True)

"""
Replace all the NaN by "Missing"
"""
candidates_df.replace(to_replace=np.nan, value = "Missing", inplace=True)
candidates_df

Unnamed: 0,Full_Name,degree,From,Age
0,Aida Kone,MS,Missing,23
1,Mamadou Diop,MS,Dakar,26
2,Ismael Camara,Bachelor,Bamako,19
3,Aicha Konate,PhD,Abidjan,Missing
4,Fanta Koumare,MS,Konakry,25
5,Khalil Cisse,Missing,Lomé,Missing


### 11. Compare two data frames and get their differences

In [None]:
import pandas as pd
from pandas.testing import assert_frame_equal

candidates_df = pd.read_csv("data/candidates.csv")

"""
Create a second dataframe by changing "Full_Name" & "Age" columns
"""
candidates_df_test = candidates_df.copy()
candidates_df_test.loc[0, 'Full_Name'] = 'Aida Traore'
candidates_df_test.loc[2, 'Age'] = 28

"""
Compare the two dataframes: candidates_df & candidates_df_test
"""
# 1. Comparison showing only unmatching values
candidates_df.compare(candidates_df_test)

# 2. Comparison including similar values
candidates_df.compare(candidates_df_test, keep_equal=True)

### 12. Get a subset of a very large dataset for quick analysis

In [None]:
# Pandas library
import pandas as pd 

# Load execution time
%load_ext autotime

# File to get sample from: Size: 261,6 MB
large_data = "diabetes_benchmark_data.csv"

# Sample size of interest
sample_size = 400

"""
Approach n°1: Read all the data in memory before getting the sample 
"""
read_whole_data = pd.read_csv(large_data)
sample_data = read_whole_data.head(sample_size)

"""
Approach n°2: Read the sample on the fly
"""
read_sample = pd.read_csv(large_data, nrows=sample_size)

### 13. Transform your data frame from a wide to a long format

In [16]:
import pandas as pd

# My experimentation data
candidates= {
    'Name':["Aida","Mamadou","Ismael","Aicha"],
    'ID': [1, 2, 3, 4],
    '2017':[85, 87, 89, 91],
    '2018':[96, 98, 100, 102],
    '2019':[100, 102, 106, 106],
    '2020':[89, 95, 98, 100],
    '2021':[94, 96, 98, 100],
    '2022':[100, 104, 104, 107],
          }
"""
Data in wide format
"""
salary_data = pd.DataFrame(candidates)

"""
Transformation into the long format
"""
long_format_data = salary_data.melt(id_vars=['Name', 'ID'], 
                                    var_name='Year', value_name='Salary(k$)')
long_format_data

Unnamed: 0,Name,ID,Year,Salary(k$)
0,Aida,1,2017,85
1,Mamadou,2,2017,87
2,Ismael,3,2017,89
3,Aicha,4,2017,91
4,Aida,1,2018,96
5,Mamadou,2,2018,98
6,Ismael,3,2018,100
7,Aicha,4,2018,102
8,Aida,1,2019,100
9,Mamadou,2,2019,102


### 14. Reduce the size of your Pandas data frame by ignoring the index

In [None]:
import pandas as pd

# Read data from Github
URL = "https://raw.githubusercontent.com/keitazoumana/Experimentation-Data/main/diabetes.csv"
data = pd.read_csv(URL)

# Create large data by repeating each row 10000 times
large_data = data.loc[data.index.repeat(10000)]

"""
SAVE WITH INDEX
"""
large_data.to_csv("large_data_with_index.csv")

# Check the size of the file 
!ls -GFlash large_data_with_index.csv

"""
SAVE WITHOUT INDEX
"""
large_data.to_csv("large_data_without_index.csv", index = False)

# Check the size of the file 
!ls -GFlash large_data_without_index.csv     

### 15. Parquet instead of CSV

In [None]:
import pandas as pd

# Read data from Github
URL = "https://raw.githubusercontent.com/keitazoumana/Experimentation-Data/main/diabetes.csv"
data = pd.read_csv(URL)

# Create large data for experimentation by repeating each row 20.000 times
exp_data = data.loc[data.index.repeat(20000)]

"""
EXPERIMENT WITH .CSV FORMAT
"""
# Write Time
%%time 
exp_data.to_csv("exp_data.csv", index=False)

# Read Time
%%time
csv_data = pd.read_csv("exp_data.csv")

# File Size
!ls -GFlash exp_data.csv

"""
EXPERIMENT WITH .PARQUET FORMAT
"""
# Write Time
%%time 
exp_data.to_parquet('exp_data.parquet')

# Read Time
%%time 
parquet_data = pd.read_parquet('exp_data.parquet')

# File Size
!ls -GFlash exp_data.parquet

### 16. Transform your data frame into a markdown

In [None]:
# .𝚝𝚘_𝚖𝚊𝚛𝚔𝚍𝚘𝚠𝚗() function

### 17. Format Date Time column

In [17]:
# 𝗽𝗮𝗿𝘀𝗲_𝗱𝗮𝘁𝗲𝘀

## II. Python tips and tricks
#### 1. Create a progress bar with tqdm and rich

In [None]:
#!pip -q install rich
from rich.progress import track
from tqdm import tqdm
import time

In [None]:
def compute_double(x):
      return 2*x

In [None]:
# rich progress bar implementation
final_dict_doubles = {}

for i in track(range(20), description="Computing 2.n..."):
  final_dict_doubles[f"Value = {i}"] = f"double = {compute_double(i)}"

  # Sleep the process to highligh the progress 
  time.sleep(0.8)

In [None]:
# tqdm progress bar implementation
for i in tqdm(range(20), desc="Computing 2.n..."):
  final_dict_doubles[f"Value = {i}"] = f"double = {compute_double(i)}"

  # Sleep the process to highligh the progress 
  time.sleep(1)

### 2. Get day, month, year, day of the week, the month of the year

In [None]:
candidates= {
    'Name':["Aida","Mamadou","Ismael","Aicha","Fatou", "Khalil"],
    'Degree':['Master','Master','Bachelor', "PhD", "Master", "PhD"],
    'From':["Abidjan","Dakar","Bamako", "Abidjan","Konakry", "Lomé"],
    'Application_date': ['11/17/2022', '09/23/2022', '12/2/2021', 
                         '08/25/2022', '01/07/2022', '12/26/2022']
          }
candidates_df = pd.DataFrame(candidates)
candidates_df['Application_date'] = pd.to_datetime(candidates_df["Application_date"])

# GET the Values
application_date = candidates_df["Application_date"]

candidates_df["Day"] = application_date.dt.day 
candidates_df["Month"] = application_date.dt.month 
candidates_df["Year"] = application_date.dt.year 
candidates_df["Day_of_week"] = application_date.dt.day_name()
candidates_df["Month_of_year"] = application_date.dt.month_name()

### 3. Smallest and largest values of a column

In [None]:
𝚍𝚏.𝚗𝚕𝚊𝚛𝚐𝚎𝚜𝚝(𝙽, “𝙲𝚘𝚕_𝙽𝚊𝚖𝚎”) → top 𝙽 rows based on 𝙲𝚘𝚕_𝙽𝚊𝚖𝚎
𝚍𝚏.𝚗𝚜𝚖𝚊𝚕𝚕𝚎𝚜𝚝(𝙽, “𝙲𝚘𝚕_𝙽𝚊𝚖𝚎”) → 𝙽 smallest rows based on 𝙲𝚘𝚕_𝙽𝚊𝚖𝚎
𝙲𝚘𝚕_𝙽𝚊𝚖𝚎 is the name of the column you are interested in.

### 4. Ignore the log output of the pip install command

You can specify the -q or — quiet option to get rid of that information.

### 5. Run multiple commands in a single notebook cell

The exclamation mark ‘!’ is essential to successfully run a shell command from your Jupyter notebook.

However, this approach can be quite repetitive 🔂 when dealing with multiple commands or a very long and complicated one.

✅ A better way to tackle this issue is to use the %%𝐛𝐚𝐬𝐡 expression at the beginning of your notebook cell.

### 6. Virtual environment.

A Data Science project can involve multiple dependencies, and dealing with all of them can be a bit annoying. 🤯

✨ A good practice is to organize your project in a way that it can be easily shared with your team members and reproduced with the least amount of effort.

✅ One way of doing this is to use virtual environments.

⚙️ 𝗖𝗿𝗲𝗮𝘁𝗲 𝘃𝗶𝗿𝘁𝘂𝗮𝗹 𝗲𝗻𝘃𝗶𝗿𝗼𝗻𝗺𝗲𝗻𝘁 𝗮𝗻𝗱 𝗶𝗻𝘀𝘁𝗮𝗹𝗹 𝗹𝗶𝗯𝗿𝗮𝗿𝗶𝗲𝘀.

→ Install the virtual environment module.
𝚙𝚒𝚙 𝚒𝚗𝚜𝚝𝚊𝚕𝚕 𝚟𝚒𝚛𝚝𝚞𝚊𝚕𝚎𝚗𝚟

→ Create your environment by giving a meaningful name.
𝚟𝚒𝚛𝚝𝚞𝚊𝚕𝚎𝚗𝚟 [𝚢𝚘𝚞𝚛_𝚎𝚗𝚟𝚒𝚛𝚘𝚗𝚖𝚎𝚗𝚝_𝚗𝚊𝚖𝚎]

→ Activate your environment.
𝚜𝚘𝚞𝚛𝚌𝚎 [𝚢𝚘𝚞𝚛_𝚎𝚗𝚟𝚒𝚛𝚘𝚗𝚖𝚎𝚗𝚝_𝚗𝚊𝚖𝚎]/𝚋𝚒𝚗/𝚊𝚌𝚝𝚒𝚟𝚊𝚝𝚎

→ Start installing the dependencies for your project.
𝚙𝚒𝚙 𝚒𝚗𝚜𝚝𝚊𝚕𝚕 𝚙𝚊𝚗𝚍𝚊𝚜
…

All this is great 👏🏼, BUT… the virtual environment you just created is local to your machine😏.

𝙒𝙝𝙖𝙩 𝙩𝙤 𝙙𝙤?🤷🏻‍♂️

💡 You need to permanently save those dependencies in order to share them with others using this command:

→ 𝚙𝚒𝚙 𝚏𝚛𝚎𝚎𝚣𝚎 > 𝚛𝚎𝚚𝚞𝚒𝚛𝚎𝚖𝚎𝚗𝚝𝚜.𝚝𝚡𝚝

This will create 𝚛𝚎𝚚𝚞𝚒𝚛𝚎𝚖𝚎𝚗𝚝𝚜.𝚝𝚡𝚝 file containing your project dependencies.

🔚 Finally, anyone can install the exact same dependencies by running this command:
→ 𝚙𝚒𝚙 𝚒𝚗𝚜𝚝𝚊𝚕𝚕 -𝚛 𝚛𝚎𝚚𝚞𝚒𝚛𝚎𝚖𝚎𝚗𝚝𝚜.𝚝𝚡𝚝

### 7. Run multiple metrics at once

In [None]:
"""
Individual imports
"""
from sklearn.metrics import precision_score, recall_score, f1_score

y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]

print("Precision: ", precision_score(y_true, y_pred, average='macro'))
print("Recall: ", recall_score(y_true, y_pred, average='macro'))
print("F1 Score: ", f1_score(y_true, y_pred, average='macro')) 


"""
Single Line import
"""
from sklearn.metrics import precision_recall_fscore_support 

precision, recall, f1_score, _ = precision_recall_fscore_support(y_true, 
                                                                 y_pred, 
                                                                 average='macro')
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1_score}")

### 8. Chain multiple lists as a single sequence

You can use a single for loop to iterate through multiple lists as a single sequence 🔂.

✅ This can be achieved using the 𝚌𝚑𝚊𝚒𝚗() ⛓ function from Python 𝗶𝘁𝗲𝗿𝘁𝗼𝗼𝗹𝘀 module.

### 9. Pretty print of JSON data

❓ Have ever wanted to print your JSON data in a correct indented format for better visualization?

✅ The indent parameter of the dumps() method can be used to specify the indentation level of your formatted string output.