**Tutorial 1. Data Aggregation: Summarising Data with Mean, Median, Mode, Standard Deviation,Variance, Quantiles, and Percentiles**

***1.1. Mean, Median, Mode, Standard Deviation, Max, Min in Pandas DataFrame***

In [None]:
import pandas as pd
from IPython.display import display

# Read the json file from the direcotory
diabities_df = pd.read_csv("/data/chapter1/diabetes.csv")

print(f'\n Mean \n \n {diabities_df.mean()}')

print(f'\n Median \n \n {diabities_df.median()}')

print(f'\n Mode \n \n {diabities_df.mode()}')

print(f'\n Varience \n \n {diabities_df.var()}')

print(f'\n Maximum \n \n {diabities_df.max()}')

print(f'\n Minimum \n \n {diabities_df.min()}')

***1.2. Mean, Median, Mode, Standard Deviation, Max, Min in Numpy Array***

In [None]:
import numpy as np
import statistics as st

# Create a numpy array
data = np.array([12, 15, 20, 25, 30, 30, 35, 40, 45, 50])

# Mean
mean = np.mean(data)

# Median
median = np.median(data)

# Mode
mode_result = st.mode(data)
mode_result

# Standard Deviation
std_dev = np.std(data)

# Maximum
maximum = np.max(data)

# Minimum
minimum = np.min(data)

print("Mean:", mean)
print("Median:", median)
print("Mode:", mode)
print("Standard Deviation:", std_dev)
print("Maximum:", maximum)
print("Minimum:", minimum)

***1.3. Variance, Quantiles, and Percentiles are computed using `var()` and `quantiles` also the `describe()` shows it***

In [None]:
import pandas as pd
from IPython.display import display

# Read the json file from the direcotory
diabities_df = pd.read_csv("/workspaces/ImplementingStatisticsWithPython/data/chapter1/diabetes.csv")

# Variance
variance = diabities_df.var()

# Quantiles (25th, 50th, and 75th percentiles)
quantiles = diabities_df.quantile([0.25, 0.5, 0.75])

# Percentiles (90th and 95th percentiles)
percentiles = diabities_df.quantile([0.9, 0.95])

display("Variance:", variance)
display("Quantiles:", quantiles)
display("Percentiles:", percentiles)

**Tutorial 2. Data Normalisation, Standardization, Transformation**

***2.1. Data Normalization on a Numpy array***

In [None]:
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# Structured data (2D array)
structured_data = np.array([[1, 2], [3, 4], [5, 6]])
scaler = MinMaxScaler()
normalized_structured = scaler.fit_transform(structured_data)

print("Normalized Structured Data:")
print(normalized_structured)

***2.2. Data Normalization on Pandas DataFrame***

In [None]:
import pandas as pd
from IPython.display import display

# Read the json file from the direcotory
diabities_df = pd.read_csv("/workspaces/ImplementingStatisticsWithPython/data/chapter1/diabetes.csv")

# Specify columns to normalize
columns_to_normalize = ['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age','Outcome']

scaler = MinMaxScaler()
diabities_df[columns_to_normalize] = scaler.fit_transform(diabities_df[columns_to_normalize])

print("Normalized Structured Data:")
diabities_df

In unstructuctered data like text normalization may involve natural language processing like convert lowercase , removing punctuation, 
handling special character like whitespaces and many more.
<br>
As shows in ***Tutorial 2.3.***
In image or audio it may involve rescaling pixel values, extracting features.

***Tutorial 2.3. Convert lowercase , removing punctuation, handling special character like whitespaces in unstructured text data***

In [None]:
import re

def normalize_text(text):
    # Convert to lowercase
    text = text.lower()
    
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

# Sample unstructured text data
unstructured_text = "This is an a text for book Implementing Stat with Python, with! various punctuation marks..."

normalized_text = normalize_text(unstructured_text)
print("Original Text:", unstructured_text)
print("Normalized Text:", normalized_text)

**Tutorial 3. Data Binning, Grouping and Encoding**
<br>
Data binning summerizes, preprocess data. It is important to handle noisy data, statical inference, detecting pattern, performing analysis.
<br>
Group set of contineous data points into discreate interval or bins. 

***3.1. Data Binning in pandas DataFrame***

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from IPython.display import display

# Read the json file from the direcotory
diabities_df = pd.read_csv("/workspaces/ImplementingStatisticsWithPython/data/chapter1/diabetes.csv")

# Define the bin intervals
bin_edges = [0, 30, 60, 100]

# Use cut to create a new column with bin labels
diabities_df['Age_Group'] = pd.cut(diabities_df['Age'], bins=bin_edges, labels=['<30', '30-60', '60-100'])

# Count the number of people in each age group
age_group_counts = diabities_df['Age_Group'].value_counts().sort_index()

# View new DataFrame with the new bin(categories) columns
diabities_df

***3.2. Data Binning in Numpy Array***

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Create a sample NumPy array of exam scores
scores = np.array([75, 82, 95, 68, 90, 85, 78, 72, 88, 93, 60, 72, 80])

# Define the bin intervals
bin_edges = [0, 60, 70, 80, 90, 100]

# Use histogram to count the number of scores in each bin
bin_counts, _ = np.histogram(scores, bins=bin_edges)

# Plot a histogram of the binned scores
plt.bar(range(len(bin_counts)), bin_counts, align='center')
plt.xticks(range(len(bin_edges) - 1), ['<60', '60-69', '70-79', '80-89', '90+'])
plt.xlabel('Score Range')
plt.ylabel('Number of Scores')
plt.title('Distribution of Exam Scores')
plt.show()

In text files, data binning can be grouping and categorizing of text data based on some criteria.To perform it simply :
1. Determine a criteria for binning. For example: Could be count of sentences in text, word count, sentiment score, topic.
2. Read text and calculate the choosen criteria for binning. For example: Count number of words in bins.
3. Define bins based on range of values for the choosen criteria. For example: Defining short, medium, long based on word count of text.
4. Assign text files appropriate bin based on calculated value.
5. Analyse or summerize the data in the new bins

Some use cases of binning in text file:
a. Grouping text files based on their length.
b. Binning based on the sentiment analysis score.
c. Topic binning by performing topic modelling.
d. Language binning if text files are in different languages.
e. Time-based binning if text files have timestamps. 

***3.3. Data Binning in Text file collection using word count***

In [None]:
import os
import glob
import pandas as pd
path = "/workspaces/ImplementingStatisticsWithPython/data/chapter1/TransactionNarrative"
files = glob.glob(path + "/*.txt")

# Function that takes a file name as an argument and returns the word count of that file
def word_count(file):
    # Open the file in read mode
    with open(file, "r") as f:
        # Read the file content
        content = f.read()
        # Split the content by whitespace characters
        words = content.split()
        # Return the length of the words list
        return len(words)

counts = [word_count(file) for file in files]

binning_df = pd.DataFrame({"file": files, "count": counts})
binning_df["bin"] = pd.cut(binning_df["count"], bins=[0, 26, 30, 35])
binning_df["bin"] = pd.cut(binning_df["count"], bins=[0, 26, 30, 35], labels=["Short", "Medium", "Long"])
binning_df

In unstructured data data binning can be used for text categorization and modelling of text data, color quantization and feature extraction on image data,
audio segmentation and feature extraction on audio data. 


***3.3. Grouping***

***3.3.1.Groping of the DataFrame based on the condition and binning***

In [None]:
import pandas as pd

# Create a DataFrame with student data
data = {'Name': ['John', 'Anna', 'Peter', 'Carol', 'David', 'Oystein','Hari'],
        'Age': [15, 16, 17, 15, 16, 14, 16],
        'Score': [85, 92, 78, 80, 88, 77, 89]}
df = pd.DataFrame(data)

# Group the data based on age intervals (e.g., 14-16, 17-18, etc.)
age_intervals = pd.cut(df['Age'], bins=[13, 16, 18])
grouped_data = df.groupby(age_intervals)['Score'].mean()

print(grouped_data)

***3.3.2. Scikit-learn digit datasets can be grouped based on relevant criteria such as the target labels***

In [None]:
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt

# Class to display and perform grouping of digits
class Digits_Grouping:
    # Contructor method to initialize the object's attributes
    def __init__(self,digits):
        self.digits = digits

    def display_digit_image(self):
        # Get the images and labels from the dataset
        images = self.digits.images
        labels = self.digits.target
        
        # Display the first few images along with their labels
        num_images_to_display = 5  # You can change this number as needed

        # Plot the selected few image in a subplot
        plt.figure(figsize=(10, 4))
        for i in range(num_images_to_display):
            plt.subplot(1, num_images_to_display, i + 1)
            plt.imshow(images[i], cmap='gray')
            plt.title(f"Label: {labels[i]}")
            plt.axis('off')
        # Display the plot
        plt.show()

    def display_label_based_grouping(self):
        # Group the data based on target labels
        grouped_data = {}
        # Iterate through each image and its corresponding target in the dataset.
        for image, target in zip(self.digits.images, self.digits.target):
            # Check if the current target value is not already present as a key in grouped_data.
            if target not in grouped_data:
                # If the target is not in grouped_data, add it as a new key with an empty list as the value.
                grouped_data[target] = []
            
            # Append the current image to the list associated with the target key in grouped_data.
            grouped_data[target].append(image)

        # Print the number of samples in each group
        for target, images in grouped_data.items():
            print(f"Target {target}: {len(images)} samples")

displayDigit = Digits_Grouping(load_digits())
displayDigit.display_digit_image()
displayDigit.display_label_based_grouping()
                    


***3.4. Encoding***

***3.4.1. One-hot encoding tutorial with method get_dummies()***

One-hot encoding creates a new column for each distinct value of the category variable, and the presence or absence of that value in each row is denoted by a binary value of 1 or 0. It encodes categorical data in a way that machine learning algorithms can understand and interpret. However, it makes the data more dimensional and produces sparse matrices.

In [None]:
# Import pandas library
import pandas as pd

# Create a sample dataframe with 3 columns: name, gender and color
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Ane', 'Bo', 'Lee', 'Dam', 'Eva'],
    'gender': ['F', 'M', 'M', 'M', 'F', 'F', 'M', 'M', 'M', 'F'],
    'color': ['red', 'blue', 'green', 'yellow', 'pink', 'red', 'blue', 'green', 'yellow', 'pink']
})

# Print the original dataframe
print(df)

# Apply one hot encoding on the gender and color columns using pandas.get_dummies()
df_encoded = pd.get_dummies(df, columns=['gender', 'color'], dtype=int)

# Print the encoded dataframe
df_encoded


***3.4.2. One-hot encoding on the complete dataframe with object column***
<br>
Dataset used:  
Becker,Barry and Kohavi,Ronny. (1996). Adult. UCI Machine Learning Repository. https://doi.org/10.24432/C5XW20.
Accessed From: https://archive.ics.uci.edu/dataset/2/adult 
Used under the Creative Commons Attribution 0.0 International license (CC0: Public Domain).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from IPython.display import display

# Read the json file from the direcotory
diabities_df = pd.read_csv("/workspaces/ImplementingStatisticsWithPython/data/chapter2/Adult_UCI/adult.data")

def one_hot_encoding(diabities_df):
    # Identify columns that are categorical to apply one hot encoding in them only
    columns_for_one_hot = diabities_df.select_dtypes(include="object").columns
    
    # Apply one hot encoding to the categorical columns
    diabities_df = pd.get_dummies(diabities_df, columns=columns_for_one_hot, prefix=columns_for_one_hot, dtype=int)
    
    # Display the transformed dataframe
    print(display(diabities_df.head(5)))

one_hot_encoding(diabities_df)

***3.4.3. Binary encoding tutorial***
<br>
For this lets use `category_encoders` using `pip install category_encoders`

Binary coding first assigns an integer value to each distinct category in the categorical variable and then converts that integer value into a binary code. Unlike hot coding, which adds a new column for each distinct category, binary coding minimizes the number of columns needed to describe categorical data. However, binary coding has drawbacks, including adding ordinality or hierarchy to categories that may not already exist and making interpretation and analysis more difficult.

In [None]:
# Import pandas library and category_encoders library
import pandas as pd
import category_encoders as ce

# Create a sample dataframe with 3 columns: name, gender and color
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Ane', 'Bo', 'Lee', 'Dam', 'Eva'],
    'gender': ['F', 'M', 'M', 'M', 'F', 'F', 'M', 'M', 'M', 'F'],
    'color': ['red', 'blue', 'green', 'yellow', 'pink', 'red', 'blue', 'green', 'yellow', 'pink']
})

# Print the original dataframe
print(df)

# Create a binary encoder object
encoder = ce.BinaryEncoder(cols=['name', 'gender', 'color'])

# Fit and transform the dataframe using the encoder
df_encoded = encoder.fit_transform(df)

# Print the encoded dataframe
print(df_encoded)


***Difference betweeen binary encoder and label encoder is:***
<br> One-hot encoding: A hot encoding creates a new column for each possible value of the categorical variable and assigns a 1 or 0 to indicate whether that value exists or not. 
For example, the gender column in the data frame can be hot coded as follows.

Binary encoding converts each possible value of the categorical variable into a binary code, and then divides the code into separate columns. 
For example, the gender column and the Color column in the data frame can be hot coded as follows.

In [None]:
from IPython.display import display
import pandas as pd
import category_encoders as ce

class Encoders_Difference:
    def __init__(self,df):
        self.df = df

    def one_hot_encoding(self):
        df_encoded1 = pd.get_dummies(df, columns=['color'],dtype=int)
        display(df_encoded1)

    def binary_encoder(self):
        encoder = ce.BinaryEncoder(cols=['color'])
        df_encoded2 = encoder.fit_transform(df)
        display(df_encoded2)
        

# Create a sample dataframe with 3 columns: name, gender and color
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Ane', 'Bo', 'Lee', 'Dam', 'Eva'],
    'gender': ['F', 'M', 'M', 'M', 'F', 'F', 'M', 'M', 'M', 'F'],
    'color': ['red', 'blue', 'green', 'yellow', 'pink', 'red', 'blue', 'green', 'yellow', 'pink']
})

encoderDifference_obj = Encoders_Difference(df)
encoderDifference_obj.one_hot_encoding()
encoderDifference_obj.binary_encoder()


***3.4.3. Label encoding tutorial***

Label encoder works by assigning an integer value to each unique category in the categorical variable, starting from 0. The transformed variable will have numerical values instead of categorical values. Its drawback is the loss of information about the similarity or difference between categories.

In [None]:
import pandas as pd
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Ane', 'Bo', 'Lee', 'Dam', 'Eva'],
    'gender': ['F', 'M', 'M', 'M', 'F', 'F', 'M', 'M', 'M', 'F'],
    'color': ['red', 'blue', 'green', 'yellow', 'pink', 'red', 'blue', 'green', 
'yellow', 
'pink']
})

# cat.codes is a method that returns the category codes of a categorical variable
df['gender_label'] = df['gender'].astype('category').cat.codes
df['color_label'] = df['color'].astype('category').cat.codes
print(df)

**Tutorial 4. Missing Data, Detecting and Treating Outliers**

**Missing Data and how to handle missing data**

Data values that are not stored or captured for some variables or observations in a dataset are referred to as missing data. It may happen for a number of reasons, including human mistakes, equipment malfunctions, data entry challenges, privacy concerns, or flaws with survey design. The accuracy and reliability of the analysis and inference can be impacted by missing data. 
In structured data identifying missing values is pretty easy whereas in semi and unstructured it may not always be the case.

Ways to handle missing data
<br>
1. Deletion
2. Imputation
3. Prediction of missing value

**4.1. Deletion or drop a column from a dataframe**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from IPython.display import display

# Read the json file from the direcotory
diabities_df = pd.read_csv("/workspaces/ImplementingStatisticsWithPython/data/chapter2/Adult_UCI/adult.data")

# Drop the 'Age' and 'Work' columns
diabities_df = diabities_df.drop(columns=[' Work', ' person_id', ' education', ' education_number',
       ' marital_status'], axis=1)

# Verify the updated DataFrame
diabities_df.head(6)

**4.2. Imputation**

Basic inputation can be done with `mean()`, `median()`, `mode()` , random constant value , or in some cases prediction can be done and predicted value can be imputed.

4.2.1 Imputation of mean value with `mean()`

In [None]:
import pandas as pd
from IPython.display import display

# Create a DataFrame with student data
data = {'Name': ['John', 'Anna', 'Peter', 'Carol', 'David', 'Oystein','Hari', 'Suresh','Ram'],
        'Age': [15, 16, np.nan, 15, 16, 14, 16, 30, 31],
        'Score': [85, 92, 78, 80, np.nan, 77, 89, 99, 76]}
student_DF = pd.DataFrame(data)
print(f'Before Mean Inputation DataFrame')
display(student_DF)

mean_age = student_DF['Age'].mean()
mean_score = student_DF['Score'].mean()
print(f'DataFrame after mean imputation')
student_DF = student_DF.fillna(value= {'Age' : mean_age, 'Score': mean_score})
display(student_DF)

4.2.2. Imputation by median value. Here we do with `median()` also `SimpleImputer()` from the sklearn `from sklearn.impute import SimpleImputer` can also be used.

In [None]:
import pandas as pd
from IPython.display import display

# Create a DataFrame with student data
data = {'Name': ['John', 'Anna', 'Peter', 'Carol', 'David', 'Oystein','Hari', 'Suresh','Ram'],
        'Age': [15, 16, np.nan, 15, 16, 14, 16, 30, 31],
        'Score': [85, 92, 78, 80, np.nan, 77, 89, 99, 76]}
student_DF = pd.DataFrame(data)
print(f'Before Median Inputation DataFrame')
display(student_DF)

median_age = student_DF['Age'].median()
median_score = student_DF['Score'].median()
print(f'DataFrame After Median Imputation')
student_DF = student_DF.fillna(value= {'Age' : median_age, 'Score': median_score})
display(student_DF)

**4.3. Prediction of missing value**
Missing values can be estimated and predicted based on other available information in the dataset. If estimates are not done properly it can introduce noise and uncertainty in the data. The missing values (missingness) can also be used as a variable, indicating whether a value was missing or not, if appropriate. But doing so can increase dimensionality. More on this will be discussed in Chapter 7.

Some general guilines:
1. If the missing data is randomly distributed across the dataset and not too many (less than 5% of the total observations), then a simple method such as replacing the missing values with the mean, median, or mode of the corresponding variable may be sufficient. This method is implemented by the SimpleImputer class in scikit-learn.
2. If the missing data are not randomly distributed or are too many (more than 5% of the total observations), a simple method may introduce bias and reduce the variability of the data. In this case, a more sophisticated method that takes into account the relationship between variables may be preferable. For example, you can use a regression model to predict the missing values based on other variables, or a nearest neighbor approach to find the most similar observation and use its value as an imputation.
3. If the missing data are longitudinal, that is, they occur in repeated measurements over time, then a method that accounts for the temporal structure of the data may be more appropriate. For example, one can use a time series model to predict the missing values based on past and future observations, or a mixed effects model to account for both fixed and random effects over time.

**Tutorial 5. Histograms, Box plots, Scatter plots, Pie Charts, Bar Charts, X-Y Plots, Heatmaps**

In [None]:
While plotting information in the plot the first thing to do is, to identify which kind of plot is appropriate for that data.
To know which plot is useful for what kind of data, you can check out this website: 
1. https://www.data-to-viz.com/
2. https://datavizproject.com/


In [None]:
5.1. Histograms

In [None]:
import matplotlib.pyplot as plt

# Create your data
data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5]

# Create the histogram
plt.hist(data)

# Customize the plot (add labels, title, etc.)
plt.xlabel('X-axis Label')
plt.ylabel('Y-axis Label')
plt.title('Histogram Title')

# Show the plot
plt.show()

In [None]:
5.2. Pie chart

In [None]:
import matplotlib.pyplot as plt

sizes = [15, 30, 45, 10]  
labels = ['Category A', 'Category B', 'Category C', 'Category D']  
colors = ['blue', 'green', 'red', 'purple']  

# Create the pie chart
plt.pie(sizes, labels=labels)

# Customize the plot (add title, aspect ratio, etc.)
plt.axis('equal')  
plt.title('Pie Chart Title')

# Show the plot
plt.show()

In [None]:
5.3. Bar plot

In [None]:
import matplotlib.pyplot as plt

# Create your data
categories = ['Category A', 'Category B', 'Category C', 'Category D']
values = [10, 25, 15, 30]

# Create the bar chart
plt.bar(categories, values)

# Customize the plot (add labels, title, etc.)
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Bar Chart Title')

# Show the plot
plt.show()

In [None]:
5.4. Line plot

In [None]:
import matplotlib.pyplot as plt

# Create your data
x_data = [1, 2, 3, 4, 5]
y_data = [10, 15, 13, 20, 18]

# Create the line plot
plt.plot(x_data, y_data)

# Customize the plot (add labels, title, etc.)
plt.xlabel('X-axis Label')
plt.ylabel('Y-axis Label')
plt.title('Line Plot Title')

# Show the plot
plt.show()

In [None]:
5.5. Scatter plot

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Create a sample DataFrame (replace this with your own DataFrame)
data = {'X': [1, 2, 3, 4, 5],
        'Y': [10, 15, 13, 20, 18]}
df = pd.DataFrame(data)

# Extract the columns you want to plot
x_data = df['X']
y_data = df['Y']

# Create a scatter plot
plt.scatter(x_data, y_data, marker='o')

# Add labels and a title
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Scatter Plot Example')

# Show a legend if needed
plt.legend()

# Display the plot
plt.show()