# 🧠 Day 1: Introduction to Machine Learning
# Welcome to the ML Workshop!




By the end of this notebook, you will be able to:
- Understand Python basics used in ML
- Load and visualize real-world data
- Fit a simple linear regression model using statsmodels and sklearn

In [None]:
# Import essential libraries that are already installed in Colab (seaborn, sklearn, tensorflow, pytorch, etc. )
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
#to see all libraries pre-uploaded in colab
!pip list

In [None]:
print(pd.__version__)
print(np.__version__)
import matplotlib
print(matplotlib.__version__)

# ===============================================
# 📌 Section 2: Python Basics for Machine Learning
# ===============================================

3Python Basics: Variables
This is where we define and use simple numeric variables.

In [None]:
#Variables and basic math
a = 5
b = 2
print("a + b =", a + b)
print("a ** b =", a ** b)


In [None]:
type(a)

In [None]:
a + b

In [None]:
c = 3.14
d= 2.54

print(c/d)

In [None]:
type(c)

In [None]:
name = "Erdener"
middle_name = "Emin"

print(name + middle_name)

In [None]:
type(name)

In [None]:
# Lists
fruits = ["apple", "banana", "cherry"]
drinks = ["water", "soda", "beer"]

print("First fruit in list:", fruits[0])
print("Last drink in list", drinks[-1])

In [None]:
fruits + drinks

In [None]:
# For loop
for meyve in fruits:
    print("I like", meyve)

In [None]:
# Conditional logic
x = 10
if x > 5:
    print("x is greater than 5")

In [None]:
# Functions
def take_square(x):
    return x ** 2

print("square of 3 is:", take_square(3))

In [None]:
take_square(3)

In [None]:
#use format-string to combine string and function output
print(f"square of {3} is : {take_square(3)}")


# ===========================================================
# 📌 Section 3: NumPy, Pandas, and Matplotlib Basics for ML
# ===========================================================

In [None]:
array = np.array([1, 2, 3, 4, 5])
array_2 =np.array([6, 7, 8, 9, 10])
print("Array:", array)
print("Mean:", np.mean(array))
print("Standard Deviation:", np.std(array))
print(array + 10)

In [None]:
array + array_2

In [None]:
list = [1,2,3,4]
list_2 = [5,6,7,8]
list + list_2


In [None]:
#DataFrame
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "Score": [85.5, 92.0, 88.0]
}

df = pd.DataFrame(data)
df


In [None]:
print(df)

In [None]:
print(df["Score"].mean())

In [None]:
# Plot a simple line chart
x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]

plt.plot(x, y)  # x'e karşılık y değerlerini çiz
plt.title("Square Function")
plt.xlabel("x ")
plt.ylabel("x sqaure")
plt.show()

# ============================================
# 📌 Section 4: Uploading and Exploring Data
# ============================================


In [None]:
#to upload files from  computer into the Colab
from google.colab import files

In [None]:
#Open a file upload dialog to select and upload files from your local machine
uploaded = files.upload()

In [None]:
df = pd.read_csv("netflix_data.csv")

In [None]:
#First 5 rows of the dataset
df.head()

In [None]:
#Data Summary
df.describe()

In [None]:
#columns and types of columns
df.info()

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# 📦 Install the ucimlrepo package to access datasets from the UCI Machine Learning Repository
!pip install ucimlrepo


In [None]:
# 📥 Fetch the Wine Quality dataset (UCI ID 186) directly from the UCI ML Repository
from ucimlrepo import fetch_ucirepo
wq = fetch_ucirepo(id=186)

In [None]:
url_2 = "https://archive.ics.uci.edu/static/public/186/data.csv"
df_2 = pd.read_csv(url_2)

In [None]:
df_2

In [None]:
df.columns

In [None]:
student_performance = fetch_ucirepo(id=320)
student_performance

In [None]:
url_student = "https://archive.ics.uci.edu/static/public/320/data.csv"
student = pd.read_csv(url_student)
student

# ========================================================
# 📌 Section 5: Simple Linear Regression
# ========================================================

### Load the dataset and preview
We’ll use the `tips.csv` dataset to explore the relationship between total bill and tip amount.


In [None]:
#CSV raw URL from GitHub
url = "https://raw.githubusercontent.com/plotly/datasets/master/tips.csv"

# you can read url with pandas
df = pd.read_csv(url)

df.head()

In [None]:
df

In [None]:
plt.figure(figsize=(8, 5))
plt.scatter(df["total_bill"], df["tip"], alpha=0.7)
plt.title("Scatter Plot: Total Bill vs Tip")
plt.xlabel("Total Bill ($)")
plt.ylabel("Tip ($)")
plt.grid(True)
plt.show()

In [None]:
import statsmodels.api as sm

# Define features (X) and target (y)
X = df[["total_bill"]]  # independent variable
y = df["tip"]           # dependent variable

# Add constant (β₀) term for intercept
X_with_const = sm.add_constant(X)

# Fit OLS model
model = sm.OLS(y, X_with_const).fit()

# Show summary
print(model.summary())

In [None]:
# Predict fitted values
y_pred = model.predict(X_with_const)


# Plot actual vs fitted
plt.figure(figsize=(8, 5))
plt.scatter(X, y, alpha=0.7, label="Actual Tips")
plt.plot(X, y_pred, color='red', linewidth=2, label="OLS Regression Line")

plt.title("OLS Regression: Tip vs Total Bill")
plt.xlabel("Total Bill ($)")
plt.ylabel("Tip ($)")
plt.legend()
plt.grid(True)
plt.show()


In [None]:
# Select predictor and target
X = df[["total_bill"]]  # independent variable
y = df["tip"]           # dependent variable

# Split into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)



In [None]:
#If you want to see coefficients
print("Intercept:", model.intercept_)
print("Coefficient:", model.coef_[0])

In [None]:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

y_pred = model.predict(X_test)

mse = round(mean_squared_error(y_test, y_pred),3)
rmse = round(np.sqrt(mse),3)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print results
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("Mean Absolute Error (MAE):", mae)
print("R² Score:", r2)

In [None]:
df.describe()

### 🧾 Interpreting Regression Error Metrics

We trained a simple linear regression model to predict the **tip** amount based on the **total bill**.

#### 📊 Dataset Summary (for reference):

- **Mean tip**: \$2.99
- **Standard deviation** of tip: \$1.38
- **Tip range**: \$1.00 to \$10.00

#### ✅ Model Evaluation:

- **Mean Squared Error (MSE): 0.569**
  - This is the average of the squared errors.
  - Since it's in squared units (dollars²), it's less interpretable directly.
  - A lower value is better, but it's more useful for comparing models than interpreting alone.

- **Root Mean Squared Error (RMSE): 0.754**
  - This is the square root of MSE and is in the same units as the target (dollars).
  - Interpretation: The model’s predictions are typically **off by about \$0.75**.
  - Considering that the standard deviation of tips is \$1.38, this is a reasonably good fit.

- **Mean Absolute Error (MAE): 0.621**
  - This is the average absolute difference between predicted and actual tips.
  - Interpretation: On average, the model is off by **about \$0.62**.
  - This is easier to interpret than RMSE and is less affected by outliers.

- **R² Score: 0.545**
  - This means the model explains **about 54.5% of the variation** in tip amounts.
  - The remaining 45.5% is unexplained (due to other factors like service quality, day, etc.).



