<a href="https://colab.research.google.com/github/Ucheekemezie/Uchechukwu_Profile/blob/master/AI_Powered_Data_Analysis_%26_Automation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AI-Powered Data Analysis & Automation

## 1. Objective
This report outlines AI-powered insights extracted from a multi-feature dataset using Google AutoML, Power BI and Python. It includes sales forecasting and loan default risk prediction. The report also interprets regression model performance and classification results, guiding business decisions around marketing, credit and risk managment.

## 2. Data Cleaning and Preparation

* **Missing values** in numeric fields such as Income, Loan_Amount, and Credit_Score were replaced using median imputation, which is robust to outliers
* **Outliers** were removed using the Interquartile Range (IQR) method, which filters values lying outside 1.5×IQR from Q1 and Q3.
* **The cleaned dataset** was saved as ‘cleaned_data.csv’ for reproducibility and further analysis.

In [47]:
import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/raw_dataset_week4.csv")
# Select only numeric columns for median imputation
numeric_cols = df.select_dtypes(include=['number']).columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())
# Replace missing values with column median
df.to_csv("cleaned_data.csv", index=False)

## 3. Sales Prediction Using Linear Regression
* Features used: ‘Marketing_Spend’ and ‘Seasonality’ (one-hot encoded).
* StandardScaler was applied post-encoding to normalise the feature values.
* The model was trained using an 80/20 train-test split and evaluated using standard regression metrics.

**Evaluation Results:**
* MSE: ₦760,701,209.93
* RMSE: ₦27,580.81

This indicates that the model's sales predictions deviate by approximately **₦27,581** on average, suggesting moderate prediction accuracy.

In [48]:
import numpy as np

# Select only numeric columns for outlier detection
numeric_cols = df.select_dtypes(include=['number']).columns
df_numeric = df[numeric_cols]

Q1 = df_numeric.quantile(0.25)
Q3 = df_numeric.quantile(0.75)
IQR = Q3 - Q1

# Filter outliers based on numeric columns
df = df[~((df_numeric < (Q1 - 1.5 * IQR)) | (df_numeric > (Q3 + 1.5 * IQR))).any(axis=1)]

In [49]:
df.to_csv("cleaned_data.csv", index=False)

In [50]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import pandas as pd
from sklearn.preprocessing import StandardScaler # Import StandardScaler

# Assuming 'df' is already loaded and cleaned from previous steps

# Select features (X) and target (y)
X = df[['Marketing_Spend', 'Seasonality']]
y = df['Sales']

# Perform one-hot encoding on the 'Seasonality' column
X = pd.get_dummies(X, columns=['Seasonality'], drop_first=True)

# Standardizing numerical features - perform this *after* one-hot encoding
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2,
random_state=42)

# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions and evaluate the model
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print(f"Model MSE: {mse}")

Model MSE: 760701209.9341159


In [51]:
rmse = np.sqrt(760701209.93)
print(f"Model RMSE: {rmse}")

Model RMSE: 27580.812350799242


## 4.Default Risk Classification Using Random Forest
* Features used: ‘Income’, ‘Loan_Amount’, ‘Credit_Score’
* Target: ‘Defaulted’ (binary classification)
* A RandomForestClassifier with 100 estimators was trained.

**Model Prediction Example:**
* Input: Income = ₦55,000, Loan Amount = ₦20,000, Credit Score = 650
* Output: Low Default Risk - Safe to approve loan

The model is effective at identifying patterns that predict financial risk, which is crucial for loan decision-making.

In [52]:
from sklearn.ensemble import RandomForestClassifier
X = df[['Income', 'Loan_Amount', 'Credit_Score']]
y = df['Defaulted']
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

In [53]:
# Predicting default risk for a new customer
new_customer = np.array([[55000, 20000, 650]]) # Income, LoanAmount, CreditScore
new_customer_scaled = scaler.transform(new_customer)
prediction = model.predict(new_customer_scaled)
if prediction[0] == 1:
  print("🚨 High Default Risk: Consider stricter loan approvalcriteria!")
else:
  print("✅ Low Default Risk: Safe to approve loan.")

✅ Low Default Risk: Safe to approve loan.




## 5. Key Insights
- Marketing_Spend is positively correlated with higher sales, confirming that promotional investments drive revenue.
- Seasonality plays a key role in sales volume, with some periods (e.g., holidays or dry season) yielding significantly more sales.
- Credit_Score and Loan_Amount are the strongest indicators of loan default risk.
- Median imputation was successful in maintaining model integrity, with no major drop in performance due to missing data.
- Customers with mid-range incomes and average credit scores typically fall into the low-risk segment for loan approvals.

## 6. Recommendations
**Sales Strategy**
- Increase marketing spend during high-performing seasonal windows to maximize ROI.
- Monitor **Seasonality** impact continuously and use time-aware forecasting for future planning.

**Loan Approval Policy**
- Use the trained classification model to screen applicants with high loan amounts and low credit scores.
- Proceed with approval for customers predicted to have low default risk, like the example case (₦55,000 income, 650 credit score).

## 7. Conclusion
The project combined data cleaning, statistical learning, and AI-based risk modeling to generate actionable business insights. The results confirm that marketing effectiveness and customer creditworthiness can be predicted and optimized with well-prepared data and machine learning. The sales forecasting and risk classification models provide a strong foundation for more advanced analytics and automation.