<a href="https://colab.research.google.com/github/awsdevguru/PearsonMLFoundations/blob/main/1_2_02_Python_Tools_Overview.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab: Python Tools for Data Science & ML  
**Goal:** Familiarize yourself with the core Python tools used in machine learning, i.e. data handling, visualization, and basic modeling.  

## 1. Setup
**Objective:** Get comfortable importing and using the core data science stack.  

These are the essential Python tools you’ll use in almost every ML project,  pandas for data, NumPy for math, matplotlib/seaborn for visualization, scikit-learn for modeling.

In [None]:
!pip install pandas numpy matplotlib seaborn scikit-learn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

## 2. Load and Explore a Dataset

**Objective:** Practice using pandas for data loading and inspection.

Show how pandas DataFrames work.
Use .shape, .columns, .isna().sum() to inspect data.
Explain how DataFrame is the foundation of most ML workflows.

In [None]:
df = sns.load_dataset("tips")

# To see what other seaborn datasets are available:
# print("Seaborn datasets:\n- {}".format("\n- ".join(sns.get_dataset_names())))


print("\nDF:")
df


In [None]:
 print("HEAD:")
df.head()

In [None]:
print("\nINFO:")
df.info()

In [None]:
print("\nDESCRIBE:")
df.describe()

## 3. Basic Data Manipulation (pandas + NumPy)
**Objective:** Demonstrate filtering, aggregations, and transformations.

In [None]:
df['tip_pct'] = df['tip'] / df['total_bill']
df.groupby('day', observed=True)['tip_pct'].mean()
df['log_total'] = np.log(df['total_bill'])
df.describe()

## 4. Visualization with matplotlib & seaborn

**Objective:** Visualize distributions and relationships.

Visualizations help identify patterns and potential model features.

In [None]:
sns.histplot(df['total_bill'], kde=True)
sns.scatterplot(x='total_bill', y='tip', hue='smoker', data=df)
plt.title('Bill vs Tip by Smoking Status')
plt.show()

## 5. Simple Modeling with scikit-learn

**Objective:** Show how to train and evaluate a simple regression model.

This illustrates the scikit-learn workflow: fit, predict, evaluate.

In [None]:
X = df[['total_bill']]
y = df['tip']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

In [None]:
# Add a  visualization:
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.scatter(X_test, y_pred, color='red', label='Predicted')
plt.legend()
plt.xlabel('Total Bill')
plt.ylabel('Tip')
plt.title('Linear Regression Fit')
plt.show()

## 6. Single Prediction

In [None]:
# Single item predictions
total_bills = [6, 22, 51, 33, 124]

for item in total_bills:
  one = pd.DataFrame([{
      "total_bill": item
  }])

  print("Total bill: {}\tPredicted tip: {}\tPercentage: {}"
    .format (item,
             round(model.predict(one)[0], 2),
             round((model.predict(one)[0] / item) * 100, 2))
    )

## 6. Evaluate the Model

**Objective:**  Introduce basic evaluation metrics.


Scikit-learn standardizes model interfaces: .fit(), .predict(), .score().

In [None]:
from sklearn.metrics import r2_score, mean_absolute_error

print("R^2:", r2_score(y_test, y_pred))
print("MAE:", mean_absolute_error(y_test, y_pred))

**R-squared score** (coefficient of determination), is a statistical measure in machine learning that indicates how well a model's predictions approximate the real data points in a regression analysis.

**Mean Absolute Error** (MAE) is the average of the absolute differences between predicted values and the true values: it tells you, on average, how far your predictions are from the targets. It’s measured in the same units as the target, ranges from 0 upward (lower is better), and is more robust to outliers than squared-error metrics because it doesn’t square the residuals.

## 7. Summary Discussion

Key tool takeaways:

| Tool               | Role in ML Workflow         |
| ------------------ | --------------------------- |
| pandas             | Data loading & cleaning     |
| NumPy              | Numerical operations        |
| matplotlib/seaborn | Visualization & exploration |
| scikit-learn       | Modeling & evaluation       |
