<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="Skills Network Logo">
    </a>
</p>


# Test Environment for Generative AI classroom labs

This lab provides a test environment for the codes generated using the Generative AI classroom.

Follow the instructions below to set up this environment for further use.


# Setup


### Install required libraries

In case of a requirement of installing certain python libraries for use in your task, you may do so as shown below.


In [22]:
%pip install seaborn
import piplite

await piplite.install(['nbformat', 'plotly'])

### Dataset URL from the GenAI lab
Use the URL provided in the GenAI lab in the cell below. 


In [23]:
URL = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-Coursera/laptop_pricing_dataset_mod1.csv"

### Downloading the dataset

Execute the following code to download the dataset in to the interface.

> Please note that this step is essential in JupyterLite. If you are using a downloaded version of this notebook and running it on JupyterLabs, then you can skip this step and directly use the URL in pandas.read_csv() function to read the dataset as a dataframe


In [24]:
from pyodide.http import pyfetch

async def download(url, filename):
    response = await pyfetch(url)
    if response.status == 200:
        with open(filename, "wb") as f:
            f.write(await response.bytes())

path = URL

await download(path, "dataset.csv")
file_name  = "dataset.csv"

---


# Test Environment


In [25]:
#AI instructions for the type of code to generate

#“Generate plan basic python code without using definition or exceptions or if else statements. Response should be concise.”

In [26]:
# Keep appending the code generated to this cell, or add more cells below this to execute in parts
#Write a Python code that can perform the following tasks:
#Read the CSV file, located on a given file path, into a Pandas data frame, assuming that the first rows of the file are the headers for the data.

import pandas as pd

# Path to the CSV file (update this to your actual file path)
file_path = file_name

# Read the CSV into a DataFrame, treating the first row as the header
df = pd.read_csv(file_path, header=0)

df.head()

Unnamed: 0.1,Unnamed: 0,Manufacturer,Category,Screen,GPU,OS,CPU_core,Screen_Size_cm,CPU_frequency,RAM_GB,Storage_GB_SSD,Weight_kg,Price
0,0,Acer,4,IPS Panel,2,1,5,35.56,1.6,8,256,1.6,978
1,1,Dell,3,Full HD,1,1,3,39.624,2.0,4,256,2.2,634
2,2,Dell,3,Full HD,1,1,7,39.624,2.7,8,256,2.2,946
3,3,Dell,4,IPS Panel,2,1,5,33.782,1.6,8,128,1.22,1244
4,4,HP,4,Full HD,2,1,7,39.624,1.8,8,256,1.91,837


In [27]:
#Write a Python code that identifies the columns with missing values in a pandas data frame and gives missing value counts per column.

# Compute missing value counts per column
missing_counts = df.isnull().sum()

# Identify columns with at least one missing value
columns_with_missing = missing_counts[missing_counts > 0].index.tolist()

# Output results
print("Missing value counts per column:")
print(missing_counts)

print("Columns with missing values:")
print(columns_with_missing)


Missing value counts per column:
Unnamed: 0        0
Manufacturer      0
Category          0
Screen            0
GPU               0
OS                0
CPU_core          0
Screen_Size_cm    4
CPU_frequency     0
RAM_GB            0
Storage_GB_SSD    0
Weight_kg         5
Price             0
dtype: int64
Columns with missing values:
['Screen_Size_cm', 'Weight_kg']


In [28]:
#Write a Python code to replace the missing values in a pandas data frame, per the following guidelines.
#1. For a categorical attribute "Screen_Size_cm", replace the missing values with the most frequent value in the column.
#2. For a continuous value attribute "Weight_kg", replace the missing values with the mean value of the entries in the column.

# 1) Replace missing Screen_Size_cm values with the most frequent value (mode)
most_freq_screen = df['Screen_Size_cm'].mode().iloc[0]

# 2) Replace missing Weight_kg values with the mean of the column
mean_weight = df['Weight_kg'].mean()

# Apply replacements for both columns in a single operation
df.fillna({'Screen_Size_cm': most_freq_screen, 'Weight_kg': mean_weight}, inplace=True)

In [29]:
#Write a Python code snippet to change the data type of the attributes "Screen_Size_cm" and "Weight_kg" of a data frame to float.

# Convert the specified columns to float (coerce non-numeric values to NaN)
df['Screen_Size_cm'] = pd.to_numeric(df['Screen_Size_cm'], errors='coerce')
df['Weight_kg'] = pd.to_numeric(df['Weight_kg'], errors='coerce')

In [30]:
#Write a Python code to modify the contents under the following attributes of the data frame as required.
#1. Data under 'Screen_Size_cm' is assumed to be in centimeters. Convert this data into inches. Modify the name of the attribute to 'Screen_Size_inch'.
#2. Data under 'Weight_kg' is assumed to be in kilograms. Convert this data into pounds. Modify the name of the attribute to 'Weight_pounds'.

# 1) Convert Screen_Size_cm (cm) to Screen_Size_inch (inches) and drop the old column
df['Screen_Size_inch'] = df['Screen_Size_cm'] / 2.54
df.drop(columns=['Screen_Size_cm'], inplace=True)

# 2) Convert Weight_kg (kg) to Weight_pounds (lb) and drop the old column
df['Weight_pounds'] = df['Weight_kg'] * 2.2046226218
df.drop(columns=['Weight_kg'], inplace=True)

In [31]:
#Write a Python code to normalize the content under the attribute "CPU_frequency" in a data frame df concerning its maximum value. Make changes to the original data, and do not create a new attribute.

# Normalize the 'CPU_frequency' column in place by its maximum value
# Assumes 'df' is a pre-loaded DataFrame containing a numeric 'CPU_frequency' column
max_val = df['CPU_frequency'].max()
df['CPU_frequency'] = df['CPU_frequency'].div(max_val)

In [32]:
#Write a Python code to perform the following tasks.
#1. Convert a data frame df attribute "Screen", into indicator variables, saved as df1, with the naming convention "Screen_<unique value of the attribute>".
#2. Append df1 into the original data frame df.
#3. Drop the original attribute from the data frame df.

# One-hot encode the 'Screen' column into separate indicator columns
# The resulting df1 will have column names like 'Screen_<category>'
df1 = pd.get_dummies(df['Screen'], prefix='Screen')

# Append the new indicator columns to the original dataframe
df = pd.concat([df, df1], axis=1)

# Drop the original 'Screen' column
df.drop(columns=['Screen'], inplace=True)

In [33]:
#Convert the values under a df column named Price from USD to Euros

# Convert the 'Price' column from USD to EUR in place
# Update the exchange rate as needed
USD_TO_EUR = 0.92  # example rate: 1 USD = 0.92 EUR

# In-place conversion
df['Price'] = df['Price'] * USD_TO_EUR

In [34]:
#Write a Python code to normalize the content under the attribute "CPU_frequency" in a data frame df concerning its minimum and maximum value. Make changes to the original data, and do not create a new attribute.

import numpy as np

# In-place min-max normalization of the 'CPU_frequency' column
min_val = df['CPU_frequency'].min()
max_val = df['CPU_frequency'].max()
range_val = max_val - min_val
# Use a safe denominator: 1 when range is 0 to avoid division by zero; this retains 0s when all values are identical
df['CPU_frequency'] = (df['CPU_frequency'] - min_val) / np.where(range_val != 0, range_val, 1)

df.head()

Unnamed: 0.1,Unnamed: 0,Manufacturer,Category,GPU,OS,CPU_core,CPU_frequency,RAM_GB,Storage_GB_SSD,Price,Screen_Size_inch,Weight_pounds,Screen_Full HD,Screen_IPS Panel
0,0,Acer,4,2,1,5,0.235294,8,256,899.76,14.0,3.527396,False,True
1,1,Dell,3,1,1,3,0.470588,4,256,583.28,15.6,4.85017,True,False
2,2,Dell,3,1,1,7,0.882353,8,256,870.32,15.6,4.85017,True,False
3,3,Dell,4,2,1,5,0.235294,8,128,1144.48,13.3,2.68964,False,True
4,4,HP,4,2,1,7,0.352941,8,256,770.04,15.6,4.210829,True,False


## Lab on Using Generative AI for Data Insights ##

In [35]:
#Write a python code to perform the following actions.
#1. Import a data set from a CSV file, The headers for the data set must be in the first row of the CSV file.
#2. Generate the statistical description of all the features used in the data set. Include "object" data types as well.

import pandas as pd

# Path to the input CSV file. Assumes the first row contains headers.
csv_path = 'dataset.csv'

# Load the dataset; pandas automatically uses the first row as header by default.
df = pd.read_csv(csv_path)

# Generate a statistical description for all features, including object types.
# include='all' yields numeric statistics for numeric columns and frequency/unique counts for object types.
description = df.describe(include='all')

# Output the description to stdout. Converts non-numeric values into readable summary.
print(description)

        Unnamed: 0 Manufacturer    Category   Screen         GPU          OS  \
count   238.000000          238  238.000000      238  238.000000  238.000000   
unique         NaN           11         NaN        2         NaN         NaN   
top            NaN         Dell         NaN  Full HD         NaN         NaN   
freq           NaN           71         NaN      161         NaN         NaN   
mean    118.500000          NaN    3.205882      NaN    2.151261    1.058824   
std      68.848868          NaN    0.776533      NaN    0.638282    0.235790   
min       0.000000          NaN    1.000000      NaN    1.000000    1.000000   
25%      59.250000          NaN    3.000000      NaN    2.000000    1.000000   
50%     118.500000          NaN    3.000000      NaN    2.000000    1.000000   
75%     177.750000          NaN    4.000000      NaN    3.000000    1.000000   
max     237.000000          NaN    5.000000      NaN    3.000000    2.000000   

          CPU_core  Screen_Size_cm  CPU

In [36]:
#Write a Python code to perform the following actions.
#1. Create regression plots for the attributes "CPU_frequency", "Screen_Size_inch" and "Weight_pounds" against "Price".
#2. Create box plots for the attributes "Category", "GPU", "OS", "CPU_core", "RAM_GB" and "Storage_GB_SSD" against the attribute "Price".

import seaborn as sns
import matplotlib.pyplot as plt

# Use a clean style for plots
sns.set(style="whitegrid")

# 1) Regression plots: Price vs CPU_frequency, Screen_Size_inch, Weight_pounds
plt.figure(figsize=(18, 5))

plt.subplot(1, 3, 1)
sns.regplot(x="CPU_frequency", y="Price", data=df, scatter_kws={"alpha": 0.6}, line_kws={"color": "orange"})
plt.xlabel("CPU_frequency")
plt.ylabel("Price")
plt.title("Price vs CPU_frequency")

plt.subplot(1, 3, 2)
sns.regplot(x="Screen_Size_cm", y="Price", data=df, scatter_kws={"alpha": 0.6})
plt.xlabel("Screen_Size_cm")
plt.ylabel("Price")
plt.title("Price vs Screen_Size_cm")

plt.subplot(1, 3, 3)
sns.regplot(x="Weight_kg", y="Price", data=df, scatter_kws={"alpha": 0.6})
plt.xlabel("Weight_kg")
plt.ylabel("Price")
plt.title("Price vs Weight_kg")

plt.tight_layout()
plt.savefig("regression_plots.png")
plt.close()

# 2) Box plots: Price by Category, GPU, OS, CPU_core, RAM_GB, Storage_GB_SSD
plt.figure(figsize=(12, 18))
cols = ["Category", "GPU", "OS", "CPU_core", "RAM_GB", "Storage_GB_SSD"]
for i, col in enumerate(cols, 1):
    plt.subplot(3, 2, i)
    x_vals = df[col].astype(str)
    sns.boxplot(x=x_vals, y="Price", data=df)
    plt.xticks(rotation=45)
    plt.xlabel(col)
    plt.ylabel("Price")
    plt.title(f"Price by {col}")

plt.tight_layout()
plt.savefig("boxplots_price_by_attrs.png")
plt.close()

In [37]:
#Write a Python code for the following.
#1. Evaluate the correlation value, pearson coefficient and p-values for all numerical attributes against the target attribute "Price".
#2. Don't include the values evaluated for target variable against itself.
#3. Print these values as a part of a single dataframe against each individual attribute.

from scipy.stats import pearsonr

# Identify numeric columns and exclude the target 'Price'
numeric_cols = [col for col in df.select_dtypes(include=["number"]).columns if col != "Price"]

# Compute correlation, Pearson coefficient, and p-value for each numeric attribute against Price
results = []
for col in numeric_cols:
    subset = df[[col, "Price"]].dropna()
    r_value = subset[col].corr(subset["Price"])  # Pearson correlation coefficient
r_coef, p_val = pearsonr(subset[col], subset["Price"])  # Pearson r and p-value
results.append({
        "Attribute": col,
        "Correlation": r_value,
        "Pearson_coefficient": r_coef,
        "p_value": p_val
    })

# Present results as a single dataframe
results_df = pd.DataFrame(results)
print(results_df)

   Attribute  Correlation  Pearson_coefficient   p_value
0  Weight_kg    -0.050707            -0.050707  0.441094


In [38]:
#Write a python code that performs the following actions.
#1. Group the attributes "GPU", "CPU_core" and "Price", as available in a dataframe df
#2. Create a pivot table for this group, assuming the target variable to be 'Price' and aggregation function as mean
#3. Plot a pcolor plot for this pivot table.

# Pivot table: mean Price by GPU (rows) and CPU_core (columns)
pivot_table = df.pivot_table(index='GPU', columns='CPU_core', values='Price', aggfunc='mean')
# Replace missing values with 0 for visualization
pivot_filled = pivot_table.fillna(0)

plt.figure(figsize=(8, 6))
# Create a colored grid
plt.pcolormesh(pivot_filled.values, cmap='viridis', shading='auto')
plt.colorbar(label='Mean Price')

# Set axis labels and ticks
plt.xticks(np.arange(pivot_filled.shape[1]) + 0.5, [str(c) for c in pivot_filled.columns], rotation=45, ha='right')
plt.yticks(np.arange(pivot_filled.shape[0]) + 0.5, [str(r) for r in pivot_filled.index])
plt.xlabel('CPU_core')
plt.ylabel('GPU')
plt.title('Mean Price by GPU and CPU_core')
plt.tight_layout()
plt.savefig('price_pcolor_pivot.png')
plt.close()

## Lab on Generatie AI for model development ##

In [39]:
#Write a Python code that can perform the following tasks.
#Read the CSV file, located on a given file path, into a pandas data frame, assuming that the first row of the file can be used as the headers for the data.

import pandas as pd

# Path to the CSV file
file_path = "dataset.csv"  # Update this to the actual file location

# Read CSV into a DataFrame, using the first row as headers (default behavior)
df = pd.read_csv(file_path)

# df now holds the data with headers from the first row
# Optional: inspect basic information
print(f"DataFrame shape: {df.shape}")
print(df.head())

DataFrame shape: (238, 13)
   Unnamed: 0 Manufacturer  Category     Screen  GPU  OS  CPU_core  \
0           0         Acer         4  IPS Panel    2   1         5   
1           1         Dell         3    Full HD    1   1         3   
2           2         Dell         3    Full HD    1   1         7   
3           3         Dell         4  IPS Panel    2   1         5   
4           4           HP         4    Full HD    2   1         7   

   Screen_Size_cm  CPU_frequency  RAM_GB  Storage_GB_SSD  Weight_kg  Price  
0          35.560            1.6       8             256       1.60    978  
1          39.624            2.0       4             256       2.20    634  
2          39.624            2.7       8             256       2.20    946  
3          33.782            1.6       8             128       1.22   1244  
4          39.624            1.8       8             256       1.91    837  


In [40]:
#Write a Python code that performs the following tasks.
#1. Develops and trains a linear regression model that uses one attribute of a data frame as the source variable and another as a target variable.
#2. Calculate and display the MSE and R^2 values for the trained model.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Separate source (X) and target (y) as numpy arrays
X = df[['CPU_frequency']].to_numpy()  # 2D array required by scikit-learn
y = df['Price'].to_numpy()

# Initialize and train the linear regression model
model = LinearRegression()
model.fit(X, y)

# Predict on the training data
y_pred = model.predict(X)

# Compute evaluation metrics
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)

# Display results
print(f'MSE: {mse:.6f}')
print(f'R^2: {r2:.6f}')

MSE: 284583.440587
R^2: 0.134444


In [41]:
#Write a Python code that performs the following tasks.
#1. Develops and trains a linear regression model that uses some attributes of a data frame as the source variables and one of the attributes as a target variable.
#2. Calculate and display the MSE and R^2 values for the trained model.

# Select source features and target
X = df[['CPU_frequency', 'RAM_GB', 'Storage_GB_SSD','CPU_core','OS', 'GPU', 'Category' ]].to_numpy()  # 2D array for scikit-learn
y = df['Price'].to_numpy()

# Initialize and train the linear regression model
model = LinearRegression()
model.fit(X, y)

# Predict on the training data
y_pred = model.predict(X)

# Compute evaluation metrics
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)

# Display results
print(f'MSE: {mse:.6f}')
print(f'R^2: {r2:.6f}')

MSE: 161680.572639
R^2: 0.508251


In [42]:
#Write a Python code that performs the following tasks.
#1. Develops and trains multiple polynomial regression models, with orders 2, 3, and 5, that use one attribute of a data frame as the source variable and another as a target variable.
#2. Calculate and display the MSE and R^2 values for the trained models.
#3. Compare the performance of the models.

from sklearn.preprocessing import PolynomialFeatures

# Source feature matrix (n_samples x n_features)
X = df[['CPU_frequency']].to_numpy()
# Target vector
y = df['Price'].to_numpy()

# Polynomial degrees to evaluate
degrees = [2, 3, 5]
results = {}

for deg in degrees:
    # Transform input to polynomial features of the given degree
    poly = PolynomialFeatures(degree=deg, include_bias=False)
    X_poly = poly.fit_transform(X)
    
    # Train a simple linear regression on the polynomial features
    model = LinearRegression()
    model.fit(X_poly, y)

    # Predict on training data
    y_pred = model.predict(X_poly)

    # Evaluate metrics
    mse = mean_squared_error(y, y_pred)
    r2 = r2_score(y, y_pred)
    results[deg] = {"mse": float(mse), "r2": float(r2)}

    print(f'Degree {deg}: MSE={mse:.6f}, R^2={r2:.6f}')

# Simple comparisons without using conditional statements
best_mse_deg = min(degrees, key=lambda d: results[d]["mse"])
best_r2_deg = max(degrees, key=lambda d: results[d]["r2"])
print(f'Best (lowest MSE): degree {best_mse_deg} with MSE {results[best_mse_deg]["mse"]:.6f}')
print(f'Best (highest R^2): degree {best_r2_deg} with R^2 {results[best_r2_deg]["r2"]:.6f}')

Degree 2: MSE=249022.665968, R^2=0.242601
Degree 3: MSE=241024.863038, R^2=0.266926
Degree 5: MSE=229137.295481, R^2=0.303082
Best (lowest MSE): degree 5 with MSE 229137.295481
Best (highest R^2): degree 5 with R^2 0.303082


In [43]:
#Write a Python code that performs the following tasks.
#1. Create a pipeline that performs parameter scaling, Polynomial Feature generation, and Linear regression. Use the set of multiple features as before to create this pipeline.
#2. Calculate and display the MSE and R^2 values for the trained model.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Features and target
X = df[['CPU_frequency', 'RAM_GB', 'Storage_GB_SSD','CPU_core','OS', 'GPU', 'Category']].to_numpy()
y = df['Price'].to_numpy()

# Create a pipeline: scaling, polynomial features, and linear regression
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('model', LinearRegression())
])

# Train the model
pipeline.fit(X, y)

# Predict on training data
y_pred = pipeline.predict(X)

# Evaluate metrics
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)

# Display results
print(f'MSE: {mse:.6f}')
print(f'R^2: {r2:.6f}')

MSE: 191186.325630
R^2: 0.418510


In [46]:
#Write a Python code that performs the following tasks.
#1. Use polynomial features for some of the attributes of a data frame.
#2. Perform Grid search on a ridge regression model for a set of values of hyperparameter alpha and polynomial features as input.
#3. Use cross-validation in the Grid search.
#4. Evaluate the resulting model's MSE and R^2 values.

import numpy as np
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV, cross_val_predict

# Assume dataset is loaded into a DataFrame named `df` in the environment
# Use last column as target and remaining numeric features as inputs
target_col = df.columns[-1]
X = df.drop(columns=[target_col]).select_dtypes(include=[np.number])
y = df[target_col]

# Define a pipeline with polynomial feature expansion and ridge regression
pipeline = Pipeline([
    ('poly', PolynomialFeatures()),
    ('ridge', Ridge())
])

# Grid search over polynomial degree, inclusion of bias, and ridge alpha with cross-validation
param_grid = {
    'poly__degree': [2, 3],
    'poly__include_bias': [False],
    'ridge__alpha': [0.1, 1.0, 10.0]
}

grid_search = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1
)

grid_search.fit(X, y)

# Build a fixed pipeline using the best hyperparameters for final evaluation with CV
best_params = grid_search.best_params_
best_poly_degree = best_params['poly__degree']
best_poly_include_bias = best_params['poly__include_bias']
best_alpha = best_params['ridge__alpha']

best_pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=best_poly_degree, include_bias=best_poly_include_bias)),
    ('ridge', Ridge(alpha=best_alpha))
])

# Obtain cross-validated predictions using the best pipeline and compute metrics
y_pred = cross_val_predict(best_pipeline, X, y, cv=5)

mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)

print("Best params:", grid_search.best_params_)
print("Cross-validated MSE:", mse)
print("Cross-validated R^2:", r2)

Best params: {'poly__degree': 2, 'poly__include_bias': False, 'ridge__alpha': 10.0}
Cross-validated MSE: 0.99814260739466
Cross-validated R^2: -0.05468241983863398


## Authors


[Abhishek Gagneja](https://www.linkedin.com/in/abhishek-gagneja-23051987/)


## Change Log


|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2023-12-10|0.1|Abhishek Gagneja|Initial Draft created|


Copyright © 2023 IBM Corporation. All rights reserved.
