<a href="https://colab.research.google.com/github/comparativechrono/Principles-of-Data-Science/blob/main/Week_1/Section_10_Python_Example__Analyzing_a_Dataset_Relevant_to_Your_Field.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Python Example: Analyzing a Dataset Relevant to Your Field

This section demonstrates a practical Python example of data analysis, specifically designed for data scientists who might be working in the retail sector. This example will walk through the process of loading data, performing exploratory data analysis (EDA), building a simple predictive model, and interpreting the results. The focus will be on analysing customer purchase data to identify patterns and predict future buying behaviors, using Python libraries such as Pandas for data manipulation and Scikit-learn for modeling.

1. Loading and Inspecting the Data:

We'll start by loading a dataset containing customer purchases from a retail store. The dataset includes customer demographics and purchase history. We will use Pandas, a powerful data manipulation library in Python, to load and inspect our dataset.

In [None]:
import pandas as pd

# Load the dataset
data = pd.read_csv('customer_purchases.csv')

# Display the first few rows of the dataframe
print(data.head())

# Get a quick overview of the dataset
print(data.describe())

2. Data Cleaning:

Data cleaning is a crucial step before any analysis. We'll clean the data by handling missing values, removing duplicates, and converting data types if necessary.

In [None]:
# Check for missing values
print(data.isnull().sum())

# Fill missing values with the median or mode as appropriate
data['age'].fillna(data['age'].median(), inplace=True)
data['last_purchase'].fillna(data['last_purchase'].mode()[0], inplace=True)

# Remove any duplicates
data.drop_duplicates(inplace=True)

# Convert data types
data['customer_id'] = data['customer_id'].astype(int)

3. Exploratory Data Analysis (EDA):

EDA involves summarizing the main characteristics of the dataset, often with visual methods. We'll use Pandas for data manipulation and Matplotlib for visualization.

In [None]:
import matplotlib.pyplot as plt

# Visualizing the distribution of ages
plt.figure(figsize=(8, 4))
plt.hist(data['age'], bins=30, color='blue', alpha=0.7)
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

# Analyzing purchase frequency
purchase_counts = data['last_purchase'].value_counts()
purchase_counts.plot(kind='bar')
plt.title('Purchase Category Frequency')
plt.xlabel('Category')
plt.ylabel('Number of Purchases')
plt.show()

4. Building a Predictive Model:

We'll use a simple logistic regression model to predict whether a customer will make a purchase in a specific category based on their age and last purchase category. Scikit-learn is a library that provides simple and efficient tools for data mining and data analysis.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Feature selection
features = data[['age', 'last_purchase']]
target = data['will_purchase_again']  # this is a binary variable (yes/no)

# Encoding categorical variables
features = pd.get_dummies(features)

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Create and train the model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predicting the test set results
y_pred = model.predict(X_test)

# Evaluating the model
print("Accuracy:", accuracy_score(y_test, y_pred))

5. Interpreting the Results:

The final step is to interpret the results of our logistic regression model. An accuracy metric will tell us how often the model predicts correctly. Further interpretation might involve looking at the coefficients of the model to understand the impact of different features on the likelihood of a customer making a purchase.