# AI Portfolio - Day 1 Basics

This notebook serves as a refresher for fundamental AI/ML concepts and sets up the foundation for our portfolio projects.

## Learning Objectives
- Review pandas and data manipulation basics
- Load and explore a simple dataset
- Train a basic machine learning model
- Understand the project structure and workflow

## Table of Contents
1. [Environment Setup](#environment-setup)
2. [Data Loading and Exploration](#data-loading-and-exploration)
3. [Basic Data Analysis](#basic-data-analysis)
4. [Simple ML Model](#simple-ml-model)
5. [Next Steps](#next-steps)


## Environment Setup

First, let's import the necessary libraries and set up our environment.


In [None]:
# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Set random seed for reproducibility
np.random.seed(42)

print("✅ Environment setup complete!")
print(f"📊 Pandas version: {pd.__version__}")
print(f"🔢 NumPy version: {np.__version__}")
print(f"📈 Matplotlib version: {plt.matplotlib.__version__}")
print(f"🎨 Seaborn version: {sns.__version__}")


## Data Loading and Exploration

Let's create a synthetic dataset to practice our data manipulation skills.


In [None]:
# Generate synthetic dataset
X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=8,
    n_redundant=2,
    n_classes=3,
    random_state=42
)

# Create DataFrame
feature_names = [f'feature_{i}' for i in range(X.shape[1])]
df = pd.DataFrame(X, columns=feature_names)
df['target'] = y

print("📊 Dataset Overview:")
print(f"Shape: {df.shape}")
print(f"Features: {df.shape[1] - 1}")
print(f"Samples: {df.shape[0]}")
print(f"Classes: {df['target'].nunique()}")

# Display first few rows
print("\n🔍 First 5 rows:")
df.head()


## Basic Data Analysis

Let's explore our dataset with some basic statistics and visualizations.


In [None]:
# Basic statistics
print("📈 Dataset Statistics:")
print(df.describe())

print("\n🎯 Target Distribution:")
print(df['target'].value_counts().sort_index())

# Check for missing values
print("\n❓ Missing Values:")
print(df.isnull().sum().sum())


In [None]:
# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Target distribution
df['target'].value_counts().plot(kind='bar', ax=axes[0, 0])
axes[0, 0].set_title('Target Distribution')
axes[0, 0].set_xlabel('Class')
axes[0, 0].set_ylabel('Count')

# Feature correlation heatmap
correlation_matrix = df[feature_names].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, ax=axes[0, 1])
axes[0, 1].set_title('Feature Correlation Matrix')

# Feature distributions
df[feature_names[:4]].hist(bins=20, ax=axes[1, 0])
axes[1, 0].set_title('Feature Distributions (First 4)')

# Box plot for feature_0 by target
df.boxplot(column='feature_0', by='target', ax=axes[1, 1])
axes[1, 1].set_title('Feature 0 by Target Class')
axes[1, 1].set_xlabel('Target Class')

plt.tight_layout()
plt.show()


## Simple ML Model

Now let's train a basic machine learning model to get familiar with the scikit-learn workflow.


In [None]:
# Prepare data for modeling
X = df[feature_names]
y = df['target']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"📊 Training set size: {X_train.shape[0]}")
print(f"📊 Test set size: {X_test.shape[0]}")
print(f"📊 Feature count: {X_train.shape[1]}")

# Train Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions
y_pred = rf_model.predict(X_test)

print("\n✅ Model training complete!")


In [None]:
# Model evaluation
print("📊 Classification Report:")
print(classification_report(y_test, y_pred))

# Confusion Matrix
plt.figure(figsize=(8, 6))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# Feature importance
feature_importance = pd.DataFrame({
    'feature': feature_names,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(data=feature_importance, x='importance', y='feature')
plt.title('Feature Importance')
plt.xlabel('Importance')
plt.tight_layout()
plt.show()


## Next Steps

Great! You've completed the basics notebook. Here's what we've covered:

### ✅ What We Learned
1. **Environment Setup**: Imported essential libraries and set up reproducible environments
2. **Data Manipulation**: Created and explored a synthetic dataset using pandas
3. **Data Visualization**: Created various plots to understand data distributions and relationships
4. **Machine Learning**: Trained a Random Forest classifier and evaluated its performance
5. **Model Interpretation**: Analyzed feature importance and model predictions

### 🚀 Next Steps
1. **P1 - RAG Project**: Start building your retrieval-augmented generation system
2. **P2 - Vision Project**: Work on computer vision and multimodal models
3. **P3 - MLOps Project**: Create production-ready ML services with proper CI/CD

### 📚 Key Takeaways
- Always set random seeds for reproducibility
- Explore your data thoroughly before modeling
- Use proper train/test splits with stratification
- Evaluate models with multiple metrics
- Visualize results for better understanding

### 🔗 Resources
- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [Scikit-learn User Guide](https://scikit-learn.org/stable/user_guide.html)
- [Matplotlib Gallery](https://matplotlib.org/stable/gallery/)
- [Seaborn Examples](https://seaborn.pydata.org/examples/)

Happy coding! 🎉
