# EDA

## Table of Contents

I. [Imports](#Imports)<br>
II. [Loading the Data](#LoadingtheData)<br>
III. [Data Cleaning](#DataCleaning) <br>
IV. [Univariate Analysis](#UnivariateAnalysis)<br>
V. [User Analysis](#UserAnalysis)<br>
VI. [Article Analysis](#Article Analysis)<br>
VII. [Outlier Detection](#Outlier Detection)<br>
VIII. [Conclusion](#Conclusion)<br>


### <a class="anchor" id="Imports">Part I : Imports</a>

In [8]:
import os
os.chdir('/Users/abiibrahim/PycharmProjects/IBM-recommendation-engine')

In [8]:
# EDA Template

## 1. Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
#import src.project_tests as t
from src import project_tests as t
import pickle

%matplotlib inline
# Setting default style for plots
sns.set_palette('pastel')

Unnamed: 0,article_id,title,email
0,1430.0,"using pixiedust for fast, flexible, and easier...",ef5f11f77ba020cd36e1105a00ab868bbdbf7fe7
1,1314.0,healthcare python streaming application demo,083cbdfa93c8444beaa4c5f5e0f5f9198e4f9e0b
2,1429.0,use deep learning for image classification,b96a4f2e92d8572034b1e9b28f9ac673765cd074
3,1338.0,ml optimization using cognitive assistant,06485706b34a5c9bf2a0ecdac41daf7e7654ceb7
4,1276.0,deploy your python model as a restful api,f01220c46fc92c6e6b161b1849de11faacd7ccb2


### <a class="anchor" id="LoadingtheData">Part II : Loading the Data</a>

In [17]:
df = pd.read_csv('data/user-item-interactions.csv')
df_content = pd.read_csv('data/articles_community.csv')
del df['Unnamed: 0']
del df_content['Unnamed: 0']

# Show first few rows
df.head()


Unnamed: 0,article_id,title,email
0,1430.0,"using pixiedust for fast, flexible, and easier...",ef5f11f77ba020cd36e1105a00ab868bbdbf7fe7
1,1314.0,healthcare python streaming application demo,083cbdfa93c8444beaa4c5f5e0f5f9198e4f9e0b
2,1429.0,use deep learning for image classification,b96a4f2e92d8572034b1e9b28f9ac673765cd074
3,1338.0,ml optimization using cognitive assistant,06485706b34a5c9bf2a0ecdac41daf7e7654ceb7
4,1276.0,deploy your python model as a restful api,f01220c46fc92c6e6b161b1849de11faacd7ccb2


In [18]:
# Basic info about the dataset
df.info()

# Summary statistics
df.describe(include='all')


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45993 entries, 0 to 45992
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   article_id  45993 non-null  float64
 1   title       45993 non-null  object 
 2   email       45976 non-null  object 
dtypes: float64(1), object(2)
memory usage: 1.1+ MB


Unnamed: 0,article_id,title,email
count,45993.0,45993,45976
unique,,714,5148
top,,use deep learning for image classification,2b6c0f514c2f2b04ad3c4583407dccd0810469ee
freq,,937,364
mean,908.846477,,
std,486.647866,,
min,0.0,,
25%,460.0,,
50%,1151.0,,
75%,1336.0,,


### <a class="anchor" id="DataCleaning">Part III : Data Cleaning</a>


In [21]:
# Checking for missing values
print(df.isnull().sum())

# Handling missing values (examples)
# df.fillna(df.mean(), inplace=True)  # For numerical columns
# df['column_name'].fillna('Unknown', inplace=True)  # For categorical columns

# Checking for duplicate values and dropping duplicates
print(f"Total duplicate values: {df.duplicated().sum()}")

article_id     0
title          0
email         13
dtype: int64
Total duplicate values: 0


### <a class="anchor" id="UnivariateAnalysis">Part IV : Univariate Analysis</a>


In [None]:
# Distribution of numerical features
df.hist(figsize=(16, 12), bins=30, edgecolor='black')
plt.suptitle('Histograms of Numerical Features', size=20)
plt.show()

# Distribution of categorical features
for col in df.select_dtypes(include='object').columns:
    plt.figure(figsize=(10, 4))
    sns.countplot(data=df, x=col, palette='viridis')
    plt.title(f'Distribution of {col}', size=15)
    plt.xticks(rotation=45)
    plt.show()


### <a class="anchor" id="UserAnalysis">Part V : User Analysis</a>


`1.` What is the distribution of how many articles a user interacts with in the dataset?  Provide a visual and descriptive statistics to assist with giving a look at the number of times each user interacts with an article.  

In [None]:
# Fill in the median and maximum number of user_article interactios below

median_val = # 50% of individuals interact with ____ number of articles or fewer.
max_views_by_user = # The maximum number of user-article interactions by any 1 user is ______.

### <a class="anchor" id="ArticleAnalysis">Part VI : Article Analysis</a>

In [None]:
# Find and explore duplicate articles

In [None]:
# Remove any rows that have the same article_id - only keep the first

`3.` Use the cells below to find:

**a.** The number of unique articles that have an interaction with a user.  
**b.** The number of unique articles in the dataset (whether they have any interactions or not).<br>
**c.** The number of unique users in the dataset. (excluding null values) <br>
**d.** The number of user-article interactions in the dataset.

In [None]:
unique_articles = # The number of unique articles that have at least one interaction
total_articles = # The number of unique articles on the IBM platform
unique_users = # The number of unique users
user_article_interactions = # The number of user-article interactions

In [None]:
`4.` Use the cells below to find the most viewed **article_id**, as well as how often it was viewed.  After talking to the company leaders, the `email_mapper` function was deemed a reasonable way to map users to ids.  There were a small number of null values, and it was found that all of these null values likely belonged to a single user (which is how they are stored using the function below).

In [None]:
most_viewed_article_id = # The most viewed article in the dataset as a string with one value following the decimal 
max_views = # The most viewed article in the dataset was viewed how many times?

In [None]:
## No need to change the code here - this will be helpful for later parts of the notebook
# Run this cell to map the user email to a user_id column and remove the email column

def email_mapper():
    coded_dict = dict()
    cter = 1
    email_encoded = []
    
    for val in df['email']:
        if val not in coded_dict:
            coded_dict[val] = cter
            cter+=1
        
        email_encoded.append(coded_dict[val])
    return email_encoded

email_encoded = email_mapper()
del df['email']
df['user_id'] = email_encoded

# show header
df.head()

In [None]:
## If you stored all your results in the variable names above, 
## you shouldn't need to change anything in this cell

sol_1_dict = {
    '`50% of individuals have _____ or fewer interactions.`': median_val,
    '`The total number of user-article interactions in the dataset is ______.`': user_article_interactions,
    '`The maximum number of user-article interactions by any 1 user is ______.`': max_views_by_user,
    '`The most viewed article in the dataset was viewed _____ times.`': max_views,
    '`The article_id of the most viewed article is ______.`': most_viewed_article_id,
    '`The number of unique articles that have at least 1 rating ______.`': unique_articles,
    '`The number of unique users in the dataset is ______`': unique_users,
    '`The number of unique articles on the IBM platform`': total_articles
}

# Test your dictionary against the solution
t.sol_1_test(sol_1_dict)

### <a class="anchor" id="OutlierDetection">Part VII : Outlier Detection</a>


In [None]:
# Boxplot for outlier detection
for col in df.select_dtypes(include=np.number).columns:
    plt.figure(figsize=(10, 4))
    sns.boxplot(x=df[col], palette='Set3')
    plt.title(f'Boxplot of {col}', size=15)
    plt.show()

# Using Z-score or IQR to detect outliers
from scipy import stats

z_scores = np.abs(stats.zscore(df.select_dtypes(include=np.number)))
df_clean = df[(z_scores < 3).all(axis=1)]

# Alternatively, using IQR
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

df_clean = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]


### <a class="anchor" id="Conclusion">Part VIII : Conclusion</a>