<a href="https://colab.research.google.com/github/amrahmani/Marketing/blob/main/AIMarketing_Ch0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Read Kaggle Datasets in Google Colab**

**1. Get Your Kaggle API Token:**

Go to the Kaggle website (https://www.kaggle.com/) and log in to your account.
Navigate to your account settings by clicking on your profile picture in the top right corner and selecting "Account."
Click on the “Settings” button.
Scroll down to the "API" section. Click on the "Create New API Token" button.
This will download a file named kaggle.json to your computer. This file contains your Kaggle API credentials.

**2. Upload Your Kaggle API Token to Google Colab:**

Open your Google Colab notebook. In the left sidebar, click on the "Files" icon.
Click the "Upload" button.
Select the kaggle.json file that you downloaded and upload it to the Colab environment.

**3. Install the Kaggle API Client in Colab:**

In a code cell in your Colab notebook, run the following command to install the Kaggle API client:

In [1]:
!pip install -q kaggle

**4. Configure the Kaggle API Credentials**:

In a new code cell, run the following commands to create the .kaggle directory and move the kaggle.json file into it with the correct permissions:

In [2]:
!mkdir -p ~/kaggle
!mv kaggle.json ~/kaggle/   # or !mv /content/kaggle.json ~/kaggle/
!chmod 600 ~/kaggle/kaggle.json

mv: cannot stat 'kaggle.json': No such file or directory
chmod: cannot access '/root/kaggle/kaggle.json': No such file or directory


In [37]:
!ls -al ~/.kaggle/kaggle.json

-rw------- 1 root root 73 Apr 27 06:55 /root/.kaggle/kaggle.json


**5. Download the Dataset from Kaggle:**

You need to know the name of the dataset as it appears on Kaggle. This usually follows the format username/dataset-name.
Go to the Kaggle dataset page you want to use. The dataset name is typically found below the dataset title (e.g., vijayuv/onlineretail).
In a new code cell in Colab, use the kaggle datasets download command followed by the dataset name and the -p flag to specify the directory where you want to download the files (e.g., ./data/).

In [28]:
!kaggle datasets download -d vijayuv/onlineretail -p ./data/

Dataset URL: https://www.kaggle.com/datasets/vijayuv/onlineretail
License(s): CC0-1.0
Downloading onlineretail.zip to ./data
 69% 5.00M/7.20M [00:00<00:00, 9.65MB/s]
100% 7.20M/7.20M [00:00<00:00, 9.37MB/s]


In [36]:
!ls -al ./data/

total 51896
drwxr-xr-x 2 root root     4096 Apr 27 07:06 .
drwxr-xr-x 1 root root     4096 Apr 27 07:04 ..
-rw-r--r-- 1 root root 45580638 Sep 21  2019 OnlineRetail.csv
-rw-r--r-- 1 root root  7548702 Sep 21  2019 onlineretail.zip


**6. Unzip the Dataset (if necessary):**

Most Kaggle datasets are downloaded as zip files. You'll need to unzip them to access the individual data files (like CSV files).

**7. Read the Data into Pandas:**

Once the dataset is unzipped, you can use the pandas library to read the data files (e.g., CSV files) into a DataFrame:

In [31]:
!unzip ./data/onlineretail.zip -d ./data/

Archive:  ./data/onlineretail.zip
  inflating: ./data/OnlineRetail.csv  


In [38]:
import pandas as pd
# Specify the encoding explicitly, for example 'latin1' or 'ISO-8859-1'
df = pd.read_csv('./data/OnlineRetail.csv', encoding='latin1')
# Now you can work with the DataFrame 'df'
print(df.head())

  InvoiceNo StockCode                          Description  Quantity  \
0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1    536365     71053                  WHITE METAL LANTERN         6   
2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   

      InvoiceDate  UnitPrice  CustomerID         Country  
0  12/1/2010 8:26       2.55     17850.0  United Kingdom  
1  12/1/2010 8:26       3.39     17850.0  United Kingdom  
2  12/1/2010 8:26       2.75     17850.0  United Kingdom  
3  12/1/2010 8:26       3.39     17850.0  United Kingdom  
4  12/1/2010 8:26       3.39     17850.0  United Kingdom  


In [45]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('./data/OnlineRetail.csv', encoding='ISO-8859-1')

print("Original DataFrame Shape:", df.shape)

# 1- Identify Missing Values
print("\n--- 1. Initial Missing Values ---\n")
print(df.isnull().sum())
initial_missing_count = df.isnull().sum().sum()
print(f"\nTotal initial missing values: {initial_missing_count}")

# 2- Impute Missing Values with Mean
print("\n--- 2. Imputing Missing Values (UnitPrice) ---\n")
print("Rows with missing UnitPrice before imputation:")
print(df[df['UnitPrice'].isnull()])

mean_unit_price = df['UnitPrice'].mean()
df['UnitPrice'].fillna(mean_unit_price, inplace=True)

print("\nRows with missing UnitPrice after imputation:")
print(df[df['UnitPrice'].isnull()])
print("\nMissing values after UnitPrice imputation:")
print(df.isnull().sum())

# 3- Remove rows or columns with a high percentage of missing values (> 50%)
print("\n--- 3. Removing Rows/Columns with High Missing Percentage (> 50%) ---\n")

# Check column-wise missing percentage
column_missing_percentage = (df.isnull().sum() / len(df)) * 100
columns_to_drop = column_missing_percentage[column_missing_percentage > 50].index
print("Columns with > 50% missing values:", columns_to_drop)

if not columns_to_drop.empty:
    print("\nDataFrame info before dropping columns:")
    df.info()
    df.drop(columns=columns_to_drop, inplace=True)
    print("\nDataFrame info after dropping columns:")
    df.info()
else:
    print("\nNo columns found with more than 50% missing values.")

# Check row-wise missing percentage
row_missing_percentage = (df.isnull().sum(axis=1) / df.shape[1]) * 100
rows_to_drop = row_missing_percentage[row_missing_percentage > 50].index
print("\nNumber of rows with > 50% missing values:", len(rows_to_drop))

if not rows_to_drop.empty:
    print("\nFirst 5 rows with > 50% missing values before dropping:")
    print(df.loc[rows_to_drop.head()])
    df.drop(index=rows_to_drop, inplace=True)
    print("\nDataFrame shape after dropping rows:", df.shape)
    print("\nFirst 5 rows with > 50% missing values after dropping:")
    # Check if any still exist (should be none)
    remaining_high_missing_rows = df[(df.isnull().sum(axis=1) / df.shape[1]) * 100 > 50]
    if not remaining_high_missing_rows.empty:
        print(remaining_high_missing_rows.head())
    else:
        print("No rows with > 50% missing values remain.")
else:
    print("\nNo rows found with more than 50% missing values.")

print("\nMissing values after handling high percentage missing rows/columns:")
print(df.isnull().sum())

Original DataFrame Shape: (541909, 8)

--- 1. Initial Missing Values ---

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

Total initial missing values: 136534

--- 2. Imputing Missing Values (UnitPrice) ---

Rows with missing UnitPrice before imputation:
Empty DataFrame
Columns: [InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country]
Index: []

Rows with missing UnitPrice after imputation:
Empty DataFrame
Columns: [InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country]
Index: []

Missing values after UnitPrice imputation:
InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

--- 3. Removing Rows/Columns with High Missing Percentage (> 50%) ---


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['UnitPrice'].fillna(mean_unit_price, inplace=True)


Columns with > 50% missing values: Index([], dtype='object')

No columns found with more than 50% missing values.

Number of rows with > 50% missing values: 0

No rows found with more than 50% missing values.

Missing values after handling high percentage missing rows/columns:
InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64
