# Exploratory Data Analysis (EDA) for Adobe Customer Segmentation

### Introduction

In this notebook, we will perform **Exploratory Data Analysis (EDA)** on the dataset for Adobe customer segmentation. The goal of this analysis is to uncover patterns, trends, and insights that will help us understand how different customer segments interact with Adobe products and services. By doing so, we aim to build a strong foundation for **marketing strategies**, **product recommendations**, and **customer engagement**.

### Why EDA?

Exploratory Data Analysis (EDA) is a crucial first step in any data analysis pipeline. It allows us to:
- **Understand the data structure**: Identifying the types of data, such as numerical, categorical, or mixed.
- **Check for missing values and outliers**: Ensuring the data is clean and that there are no issues that could skew the results.
- **Discover patterns and trends**: Visualizing relationships between features and identifying which variables are most relevant to segmentation.
- **Form hypotheses**: Gaining insights to generate hypotheses that we can test in later analyses or modeling phases.

In the context of Adobe, the purpose of EDA is to uncover patterns that can guide us in creating **targeted marketing campaigns** based on customer behavior, subscription plans, and product usage. Additionally, EDA will help us understand how features like **usage frequency** and **document types** correlate with customer profiles.

---

### Key Questions We Want to Answer

Through this EDA, we aim to address the following key questions:
1. **Customer Behavior Patterns**:
    - How do different customer segments engage with various Adobe products (e.g., Photoshop, Illustrator, After Effects)?
    - What is the usage distribution across weekdays, weekends, and across different usage types (e.g., mobile vs. desktop)?
   
2. **Product Usage Trends**:
    - What products do customers use the most?
    - Are there correlations between product usage and subscription types (e.g., premium vs. standard plans)?

3. **Segmentation Insights**:
    - Can we identify distinct groups based on customer activity (e.g., heavy users vs. light users)?
    - What are the key features that distinguish different customer clusters?

4. **Marketing and Engagement**:
    - How can we target users based on their usage patterns, document categories, and engagement (e.g., weekend vs. weekday)?
    - What features correlate with high customer satisfaction or retention?

---

### EDA Steps

To achieve these objectives, we will proceed with the following steps:
1. **Data Overview**:
    - Check the structure of the dataset.
    - Inspect basic statistics (mean, median, min, max) and data types.

2. **Missing Values and Duplicates**:
    - Check for missing data and decide on the strategy (e.g., imputation or removal).
    - Remove any duplicate records to ensure clean analysis.

3. **Univariate Analysis**:
    - Analyze individual features (e.g., subscription type, usage on weekdays, or mobile vs. desktop usage).
    - Visualize distributions of numerical features and counts of categorical features.

4. **Bivariate and Multivariate Analysis**:
    - Investigate relationships between pairs of variables (e.g., product usage vs. subscription type).
    - Create scatter plots, box plots, and correlation matrices to identify patterns.

5. **Feature Correlation**:
    - Examine how features are correlated with each other (e.g., the relationship between different usage types).
    - Identify any potential multicollinearity that may impact clustering or segmentation.

6. **Customer Segmentation Insights**:
    - Investigate if distinct clusters or segments emerge from the data (e.g., heavy product users vs. occasional users).
    - Visualize clusters to better understand different customer groups.

---

### Justification for EDA Approach

1. **Identifying Key Customer Segments**:
    - Understanding customer segmentation through features like **subscription type**, **usage patterns**, and **product preferences** helps Adobe tailor marketing campaigns effectively. We can identify the needs and behaviors of high-value customers and build more targeted strategies.
   
2. **Data Cleaning and Preprocessing**:
    - The integrity of the data is vital for accurate modeling. EDA allows us to **identify missing values**, **outliers**, and **incorrect data entries**. Cleaning and preprocessing data before modeling is essential for ensuring reliable outcomes.

3. **Insight Generation**:
    - EDA is a valuable step to generate insights that can guide our clustering and segmentation efforts. By looking at patterns in customer usage data, we can better understand which factors are most influential in customer behavior and purchasing decisions.

4. **Informing Future Steps**:
    - By conducting EDA, we build a solid understanding of the data, which will inform future analysis steps such as **predictive modeling**, **customer segmentation**, and **marketing strategies**.

---

### Conclusion

This EDA will be a crucial first step toward understanding customer behavior and finding meaningful insights that can drive the next phase of analysis. The goal is to ensure that the **data is clean**, the **right features** are identified, and any **patterns or trends** are uncovered to help Adobe build a more effective marketing and engagement strategy.

Let’s begin with the first step of the EDA process: inspecting the data and understanding its structure!


In [53]:
import pandas as pd
df = pd.read_csv('C:/Users/mayuo/OneDrive/Documents/Machine Learning by Abraham/interview_take_home/data/adobe_df.csv')
df.head(3)

Unnamed: 0,index,ps_cluster,market_segment,sub_type,ps_weekday_working_usage,ps_weekday_nonworking_usage,ps_weekend_usage,sv_job_collated,sv_skill_collated,sv_purpose_collated,...,tb_tot_no_activities,tb_activity_on_after_effects,tb_activity_on_bridge,tb_activity_on_illustrator,tb_activity_on_indesign,tb_activity_on_lightroom,tb_activity_on_media_encoder,tb_activity_on_photoshop,tb_activity_on_premiere_pro,tb_activity_on_adobe_xd
0,0,Photo Enthusiast,COMMERCIAL,Phtoshp Lightrm Bndl,0.621516,0.0,0.378484,hobbyist,all_three_skill_levels,me_nonprofessional,...,0.18232,0.0,0.033149,0.0,0.0,0.022099,0.0,0.127072,0.0,0.0
1,1,Independent Photo Pro,COMMERCIAL,Phtoshp Lightrm Bndl,0.719328,0.014286,0.266387,-1,-1,-1,...,0.237569,0.0,0.0,0.0,0.0,0.198895,0.0,0.038674,0.0,0.0
2,2,Independent Photo Pro,COMMERCIAL,Phtoshp Lightrm Bndl,0.566488,0.244484,0.189028,-1,experienced,-1,...,0.651934,0.0,0.0,0.0,0.0,0.563536,0.0,0.088398,0.0,0.0


# Adobe Customer Segmentation Project - Feature Justifications

In this project, we aim to perform customer segmentation based on Adobe product usage data. By analyzing user behavior, preferences, and hardware configurations, we can create meaningful customer profiles. These profiles will assist in **marketing strategies**, **product recommendations**, and help Adobe optimize its product offerings for different user segments.

Below is a detailed explanation of the features we've selected for the analysis, along with the rationale behind including each of them. These features have been carefully chosen to provide a comprehensive view of Adobe users' behavior, their workflows, and the hardware they use, all of which will play a key role in developing actionable insights.

---

### **1. `ps_cluster`**

- **Justification**: This feature represents the **existing user clusters** in the dataset. By incorporating this, we can evaluate how well the clusters align with our new segmentation model, and refine or validate our approach accordingly. Understanding the relationship between our new and existing clusters will allow for better targeting of marketing efforts.
- **Assumption**: **Clustering is already part of the dataset**, and we can use this as a **benchmark** to compare with new clusters we derive from the features.

---

### **2. Subscription Information**

- **Features**: `market_segment`, `sub_type`
- **Justification**: These features provide crucial insights into the **type of subscription** a user has, and the **market segment** they belong to (e.g., individuals, businesses, educational). Knowing this helps differentiate between **professional users** and **hobbyist users**, guiding targeted **sales strategies** and **product offerings**.
- **Assumption**: **Users in higher-tier plans** may need more advanced tools, while **lower-tier users** may focus on basic features. Subscription info is also key to identifying opportunities for **upselling**.

---

### **3. Photoshop Usage (Weekday vs Weekend)**

- **Features**: `ps_weekday_working_usage`, `ps_weekday_nonworking_usage`, `ps_weekend_usage`
- **Justification**: The usage patterns during the **weekdays** and **weekends** reveal when users are most active. This can indicate if they are primarily **working professionals** or **casual users**. Understanding these behaviors is vital for targeting users with **time-specific promotions** or tailoring product features accordingly.
- **Assumption**: **Weekday users** are likely more **professionally engaged**, while **weekend users** might be pursuing **hobbyist or creative tasks**.

---

### **4. Survey Responses (Job, Skill, Purpose)**

- **Features**: `sv_skill_collated`, `sv_purpose_collated`
- **Justification**: These features offer insights into the **user’s skill level** and **purpose** for using Adobe products. Whether users are beginners, experts, or professionals directly impacts the **tools** and **features** they need. This helps identify **targeted content** or **training materials** and can influence **product development**.
- **Assumption**: **High-skill users** may require **advanced features**, while **low-skill users** may prefer **user-friendly tools**.

---

### **5. Photoshop File Information**

- **Features**: `ps_doc_average_file_size`, `ps_doc_average_openseconds`
- **Justification**: The **average file size** and **open seconds** give insights into how **complex** users’ Photoshop files are. Larger files or longer open times suggest **advanced usage**—likely professionals working on **high-resolution graphics** or multi-layered designs.
- **Assumption**: **Advanced users** tend to work with more **complex files** that require more powerful software capabilities, while **basic users** might work with lighter, simpler files.

---

### **6. Email Type**

- **Feature**: `generic_email`
- **Justification**: The **email type** can provide hints on whether the user is from a **professional or educational domain** (e.g., `.edu` or `.org`) or a **general user** (e.g., Gmail, Yahoo). Users with domain-specific emails might have different product needs or engagement levels.
- **Assumption**: Users with **professional or educational emails** might be more likely to engage with **enterprise-level products** or **business-focused features**.

---

### **7. Mobile vs Web/Desktop Usage**

- **Features**: `mobile_usage`, `web_usage`
- **Justification**: Users' device preferences—whether they use Adobe tools on **mobile**, **desktop**, or via **web applications**—can reveal their engagement level and needs. Desktop users are more likely to perform **resource-heavy tasks**, while mobile users might prefer quick, on-the-go functionality.
- **Assumption**: **Mobile users** may need **simplified tools** with a focus on convenience, while **desktop users** are likely more **intensive and advanced** in their usage.

---

### **8. Camera Used (if Available)**

- **Features**: `count_camera_make`, `count_camera_model`
- **Justification**: The camera make and model provide insights into the **type of content** users work with, especially for **photographers** and **videographers**. **Professional-grade cameras** (e.g., DSLR or mirrorless) may indicate a need for **advanced editing tools** such as **RAW file processing**.
- **Assumption**: **High-end cameras** suggest a **professional user** who might engage in **advanced photo and video editing**, potentially indicating a need for **specialized Adobe tools**.

---

### **9. Photoshop Document Categories**

- **Features**: `ps_doc_category_*` (e.g., watercolor, photo, typography)
- **Justification**: These features show the **type of documents** users work on, ranging from **artistic styles** (e.g., watercolor, vector art) to more **professional use cases** (e.g., banners, 3D models). Knowing this helps tailor **feature development** and **marketing strategies** to specific content creation needs.
- **Assumption**: Users working on **specific document categories** may require **specialized tools**. For example, **3D designers** need more **3D features** and **high-end rendering tools**.

---

### **10. Machine Information**

- **Features**: `machine_ps_max_memory`, `machine_ps_max_speed`, `machine_ps_max_numprocessors`, `operating_system`
- **Justification**: The hardware specifications, such as **memory** and **processors**, help understand whether the user is a **professional** with a **high-performance machine** or a **casual user** with less powerful equipment. This distinction can help in determining whether the user needs **advanced or lightweight tools**.
- **Assumption**: Users with **high-performance machines** likely need **powerful tools** to match their hardware capabilities. This might indicate **enterprise users** or **creative professionals**.

---

### **11. Adobe Products Used**

- **Features**: `most_used_products`, `num_used_products`, `tb_activity_on_*` (various Adobe tools like Photoshop, Illustrator, Premiere Pro)
- **Justification**: Understanding which Adobe products users engage with most often and how frequently they use them provides deep insights into their **workflows**. This is useful for identifying potential **cross-selling** opportunities (e.g., selling **Illustrator** to heavy **Photoshop** users).
- **Assumption**: Users who engage with multiple Adobe products are more likely to be **professional creatives** who need **bundled solutions** or **premium features**.

---

### **Conclusion**

The selected features cover a wide spectrum of user behavior and preferences, providing key insights into both **user intent** and **usage patterns**. By analyzing these features, we can effectively segment users and make **data-driven marketing decisions**, **product recommendations**, and **feature enhancements** based on user needs. This holistic view allows for **more personalized and effective strategies** for engaging Adobe’s diverse customer base.


In [54]:
selected_features = [
    'ps_cluster', 'market_segment', 'sub_type', 'ps_weekday_working_usage', 
    'ps_weekday_nonworking_usage', 'ps_weekend_usage', 'sv_job_collated',
    'sv_skill_collated', 'sv_purpose_collated', 'ps_doc_average_file_size','ps_doc_average_openseconds', 'generic_email', 
    'mobile_usage', 'web_usage', 'count_camera_make', 'count_camera_model', 
    'ps_doc_category_watercolor', 'ps_doc_category_vec_art', 'ps_doc_category_painting',
    'ps_doc_category_photo', 'ps_doc_category_typography', 'ps_doc_category_poster', 
    'ps_doc_category_sketch', 'ps_doc_category_pattern_texture', 'ps_doc_category_meme',
    'ps_doc_category_adv_banner', 'ps_doc_category_screenshot', 'ps_doc_category_3d',
    'total_photo_usage', 'total_design_usage', 'total_illustration_usage', 'total_video_usage', 
    'total_3d_usage', 'machine_ps_max_memory', 'machine_ps_max_speed', 
    'machine_ps_max_numprocessors', 'operating_system', 'most_used_products', 'num_used_products', 
    'tb_activity_on_acrobat', 'tb_tot_no_activities', 'tb_activity_on_after_effects', 'tb_activity_on_bridge',
    'tb_activity_on_illustrator', 'tb_activity_on_indesign', 'tb_activity_on_lightroom', 'tb_activity_on_media_encoder',
    'tb_activity_on_photoshop', 'tb_activity_on_premiere_pro', 'tb_activity_on_adobe_xd'
]

df_selected = df[selected_features]
df_selected.shape

(12249, 50)

In [55]:
df_selected.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12249 entries, 0 to 12248
Data columns (total 50 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   ps_cluster                       12249 non-null  object 
 1   market_segment                   12249 non-null  object 
 2   sub_type                         12249 non-null  object 
 3   ps_weekday_working_usage         12249 non-null  float64
 4   ps_weekday_nonworking_usage      12249 non-null  float64
 5   ps_weekend_usage                 12249 non-null  float64
 6   sv_job_collated                  12249 non-null  object 
 7   sv_skill_collated                12249 non-null  object 
 8   sv_purpose_collated              12249 non-null  object 
 9   ps_doc_average_file_size         12249 non-null  float64
 10  ps_doc_average_openseconds       12249 non-null  float64
 11  generic_email                    12249 non-null  bool   
 12  mobile_usage      

In [56]:
# Extract object columns from the DataFrame
object_columns = df_selected.select_dtypes(include=['object'])

# Apply value_counts to each object column and display the results
for col in object_columns.columns:
    print(f"Value counts for {col}:\n")
    print(df[col].value_counts())
    print("\n" + "-"*50 + "\n")

Value counts for ps_cluster:

ps_cluster
Photo Enthusiast                3585
Next Generation Creative        2920
Traditional Graphic Designer    2275
Independent Photo Pro           1899
Interactive Designer            1570
Name: count, dtype: int64

--------------------------------------------------

Value counts for market_segment:

market_segment
COMMERCIAL    10633
EDUCATION      1616
Name: count, dtype: int64

--------------------------------------------------

Value counts for sub_type:

sub_type
Phtoshp Lightrm Bndl    4747
Creative Cloud Indiv    4237
Creative Cloud          1883
Photoshop                536
Acrobat Pro Subs CC      275
CC Photography New       253
Illustrator              113
Adobe Premiere Pro        75
InDesign                  59
Lightroom CC              23
Adobe Premiere RUSH       12
Dreamweaver               10
After Effects              6
Acrobat Professional       6
CCI & Stock                3
Adobe XD                   2
Photoshop Express         

My thought process going into this:
- Group the smaller subscription types into "Other" to many category causes sparse categories that could distort the segmentation/analysis
- for skill level we have a -1 as a place holder so i can perform imputation or put as unknown or just drop the column. Also i need to just mapp the concatinated skill. to keep them as just 3. 
- for purpose the is combination of values that aren't reasonable. I would have to map them into something simple. 
- for machine_memory- I can just put this to be numeric then grp them to become categories of like 5 low, medium, high etc... 
- max_speed, i can bin it after converting to numeric or just leave as numeric! 
- for processors, i can change to numeric and bin or leave it. 
- for used products, the less used onces can be bined into others. 
- SO because of the collated categories, i think i can add "no responce" 

# Data Preprocessing Justification

## Why We're Removing Specific Columns:

In this project, we are working with a rich dataset containing various features related to Adobe product usage and user activities. However, certain features require special handling due to their high volume of missing or placeholder values, which can potentially skew our analysis and introduce noise. 

We have decided to remove the following columns for the following reasons:

### 1. **sv_job_collated, sv_skill_collated, sv_purpose_collated:**
These columns contain a high proportion of placeholder values ("-1"), indicating a lack of response or unavailable data. Since these placeholders represent non-responses rather than actual values, they would likely introduce bias if left in the dataset. 

- **sv_job_collated**: The job role of the user, with categories like "hobbyist" and "photographer," but also with a large number of missing or placeholder values.
- **sv_skill_collated**: User skill levels, including categories like "beginner," "intermediate," and "experienced," but again, a high number of "-1" values, which likely represent missing data.
- **sv_purpose_collated**: The purpose behind using the Adobe products (professional, non-professional, or organizational use), but with a complex structure that includes placeholders and many possible combinations.

Given that these columns have many missing or irrelevant values, keeping them could add unnecessary complexity. Instead, we believe it's more beneficial to focus on other features like actual usage metrics (e.g., `tb_activity_on_photoshop`, `tb_tot_no_activities`) to infer skills or purposes based on behavior.

### 2. **Activity-based Skill Inference:**
Rather than relying on the potentially inaccurate or incomplete data in `sv_skill_collated` or `sv_purpose_collated`, we can infer user behavior and skills from actual product usage data (e.g., time spent on each product, number of activities performed). This will allow us to create more robust skill and purpose classifications based on real actions taken by users, rather than relying on incomplete survey data.

### 3. **Separation for Backup:**
We will store the removed columns in a separate DataFrame (`df_removed`) as a backup. This allows us to keep these columns for potential future analysis or imputation strategies if needed, without losing any data. This separation also provides us the flexibility to revisit these features after we test and refine our models.

### Conclusion:
Removing these columns will reduce noise and simplify the dataset. The focus will shift towards more reliable features that represent actual user behavior, allowing us to create more meaningful user segments and better understand the value drivers for different customer segments. We may revisit these features later if we choose to experiment with imputation or other methods for handling missing data.


In [57]:
df_no_col = df_selected.drop(columns=[
    'sv_job_collated', 'sv_skill_collated', 'sv_purpose_collated'
])

df_removed = df_selected.copy()
# Check the shape of both dataframes to confirm
print(f"Original DataFrame shape: {df.shape}")
print(f"Cleaned DataFrame shape: {df_selected.shape}")
print(f"Removed Columns DataFrame shape: {df_no_col.shape}")

Original DataFrame shape: (12249, 60)
Cleaned DataFrame shape: (12249, 50)
Removed Columns DataFrame shape: (12249, 47)


### Grouping Small Categories into "Other" - Justification

To simplify the analysis and improve the interpretability of the data, we group categories that represent less than 10% of the total into an "Other" category. This helps in the following ways:

1. **Focus on Major Categories**: By consolidating smaller categories, we prioritize the most significant data.
2. **Reduce Noise**: Low-frequency categories can introduce noise, which may not be useful for modeling or analysis.
3. **Improve Manageability**: Fewer unique categories make the data easier to handle and visualize.
4. **Maintain Interpretability**: Simplifies the results, making the dataset more interpretable while preserving meaningful information.

By applying a 10% threshold, we ensure that the most frequent categories remain distinct, while less common ones are grouped together.


In [58]:
sub_type_pro = df_no_col['sub_type'].value_counts(normalize=True)
sub_type_pro

sub_type
Phtoshp Lightrm Bndl    0.387542
Creative Cloud Indiv    0.345906
Creative Cloud          0.153727
Photoshop               0.043759
Acrobat Pro Subs CC     0.022451
CC Photography New      0.020655
Illustrator             0.009225
Adobe Premiere Pro      0.006123
InDesign                0.004817
Lightroom CC            0.001878
Adobe Premiere RUSH     0.000980
Dreamweaver             0.000816
After Effects           0.000490
Acrobat Professional    0.000490
CCI & Stock             0.000245
Adobe XD                0.000163
Photoshop Express       0.000163
Adobe Spark             0.000163
CCT & Stock             0.000082
Animate / Flash Pro     0.000082
Lightroom w Classic     0.000082
Adobe InCopy            0.000082
CCPP + Stock            0.000082
Name: proportion, dtype: float64

In [59]:
threshold = 0.10 
small_cat = sub_type_pro[sub_type_pro < threshold].index

df_no_col['sub_type'] = df_no_col['sub_type'].replace(small_cat, 'Others')
df_no_col['sub_type'].value_counts()

sub_type
Phtoshp Lightrm Bndl    4747
Creative Cloud Indiv    4237
Creative Cloud          1883
Others                  1382
Name: count, dtype: int64

### Handling 'machine_ps_max_memory' Column

The `machine_ps_max_memory` column contains both valid and invalid values (e.g., `-1`, `NoValue`) that need to be addressed before proceeding with analysis. The steps taken are:

1. **Conversion to Numeric**: We converted the `machine_ps_max_memory` values to numeric data types to ensure consistency and facilitate downstream analysis. Non-numeric values, such as invalid strings, are coerced to `NaN`.

2. **Handling Invalid Values**: We replaced `-1` and `NoValue` with `NaN` to eliminate invalid entries from the analysis. These values were likely used as placeholders or for missing data.

3. **Removing Rows with NaN Values**: After replacing invalid values, rows containing `NaN` were dropped, ensuring that only valid memory values remain for categorization.

4. **Binning Memory Values**: Memory sizes were binned into four categories based on reasonable thresholds:
   - **Low**: Memory size ≤ 8 GB
   - **Medium**: Memory size between 9 GB and 16 GB
   - **High**: Memory size between 17 GB and 32 GB
   - **Very High**: Memory size > 32 GB

   These categories were chosen to group similar memory capacities together, making it easier to understand and compare machine capabilities at a high level.

By following this approach, we ensure that the `machine_ps_max_memory` data is clean, consistent, and categorized in a way that supports further analysis and modeling.


In [60]:
df_no_col['machine_ps_max_memory'].value_counts()

machine_ps_max_memory
8192       4151
16384      4035
32768      2024
65536       574
4096        410
12288       315
24576       239
6144        119
40960       119
49152        59
0            29
98304        28
20480        26
-1           21
131072       15
28672         9
65534         8
10240         7
73728         7
2048          6
9216          4
3072          3
81920         3
17408         2
18432         2
14336         2
393216        2
57344         2
5120          2
NoValue       2
61440         2
7758          1
15998         1
53248         1
7987          1
32767         1
16322         1
8097          1
196608        1
45056         1
32772         1
32718         1
8030          1
36864         1
262144        1
32775         1
16400         1
8084          1
8189          1
7113          1
16376         1
6031          1
30720         1
Name: count, dtype: int64

In [61]:
df_no_col['machine_ps_max_memory'] = pd.to_numeric(df_no_col['machine_ps_max_memory'], errors='coerce')

df_no_col['machine_ps_max_memory'] = df_no_col['machine_ps_max_memory'].replace([-1, 'NoValue'], pd.NA)
df_no_col = df_no_col.dropna(subset=['machine_ps_max_memory'])

df_no_col['machine_ps_max_memory'].value_counts()

machine_ps_max_memory
8192.0      4151
16384.0     4035
32768.0     2024
65536.0      574
4096.0       410
12288.0      315
24576.0      239
6144.0       119
40960.0      119
49152.0       59
0.0           29
98304.0       28
20480.0       26
131072.0      15
28672.0        9
65534.0        8
10240.0        7
73728.0        7
2048.0         6
9216.0         4
3072.0         3
81920.0        3
17408.0        2
14336.0        2
18432.0        2
393216.0       2
57344.0        2
5120.0         2
61440.0        2
7758.0         1
7987.0         1
15998.0        1
53248.0        1
32767.0        1
32718.0        1
16322.0        1
8097.0         1
196608.0       1
45056.0        1
36864.0        1
8030.0         1
32772.0        1
262144.0       1
32775.0        1
16400.0        1
8084.0         1
8189.0         1
7113.0         1
16376.0        1
6031.0         1
30720.0        1
Name: count, dtype: int64

In [62]:
bins = [0, 8000, 16000, 32000, float('inf')]  # Defining the bin edges
labels = ['Low', 'Medium', 'High', 'Very High']  # Category labels
df_no_col['memory_category'] = pd.cut(df_no_col['machine_ps_max_memory'], bins=bins, labels=labels, right=False)

df_no_col['memory_category'].value_counts()

memory_category
Medium       4484
High         4317
Very High    2852
Low           573
Name: count, dtype: int64

In [63]:
df_no_col.drop(columns=['machine_ps_max_memory'], inplace=True)
df_no_col.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12226 entries, 0 to 12248
Data columns (total 47 columns):
 #   Column                           Non-Null Count  Dtype   
---  ------                           --------------  -----   
 0   ps_cluster                       12226 non-null  object  
 1   market_segment                   12226 non-null  object  
 2   sub_type                         12226 non-null  object  
 3   ps_weekday_working_usage         12226 non-null  float64 
 4   ps_weekday_nonworking_usage      12226 non-null  float64 
 5   ps_weekend_usage                 12226 non-null  float64 
 6   ps_doc_average_file_size         12226 non-null  float64 
 7   ps_doc_average_openseconds       12226 non-null  float64 
 8   generic_email                    12226 non-null  bool    
 9   mobile_usage                     12226 non-null  float64 
 10  web_usage                        12226 non-null  float64 
 11  count_camera_make                12226 non-null  float64 
 12  count_cam

### Justification for Handling and Binning `machine_ps_max_speed` Feature:

#### 1. **Conversion to Numeric**:
   - **Why**: The original values of `machine_ps_max_speed` may include non-numeric entries or placeholders such as 'NoValue' or '0'. Converting the column to numeric ensures that we work with valid numerical data for analysis or model building. 
   - **How**: Using `pd.to_numeric()`, we handle potential errors by coercing invalid values to `NaN` which can then be filtered out.

#### 2. **Removing Invalid or Irrelevant Values**:
   - **Why**: Entries such as 'NoValue' and 0 are not meaningful for machine performance analysis. These need to be excluded to avoid distorting our understanding of machine capabilities.
   - **How**: We filter out rows containing these invalid values using the `notna()` and condition checks, ensuring the dataset reflects actual, relevant machine performance data.

#### 3. **Binning the Speed Values**:
   - **Why**: Raw continuous data, like machine speeds, can be difficult to interpret or analyze in its raw form. By grouping the speeds into bins (e.g., Low, Medium, High, Very High), we simplify the data into meaningful categories that can be more easily related to user activity or machine performance.
   - **How**: We define speed categories using specific numeric ranges, which can be adjusted based on the distribution of values or domain knowledge. This process reduces noise and makes it easier to analyze correlations with other features.

#### 4. **Dropping the Original Column**:
   - **Why**: After binning, the original `machine_ps_max_speed` column is redundant and could lead to confusion in the analysis or modeling process. Removing it ensures the dataset is clean and that we're only working with the new categorical feature, which provides the necessary context.
   - **How**: We use `df.drop()` to remove the original column from the dataframe.

#### Overall Benefit:
   - **Improved Interpretability**: By transforming raw numeric values into categories, it enhances the understanding of machine capabilities and their correlation with other features.
   - **Model Performance**: Some machine learning algorithms perform better with categorical features rather than continuous ones, especially when the data contains outliers or a wide range of values.
   - **Consistency**: This approach provides consistent, categorized data, making it easier to analyze patterns in user behavior, machine usage, and product interaction.

This approach ensures that we are working with relevant, clean, and interpretable data, which can improve both exploratory data analysis and


In [64]:
# Convert machine_ps_max_speed to numeric (if not already)
df_no_col['machine_ps_max_speed'] = pd.to_numeric(df_no_col['machine_ps_max_speed'], errors='coerce')

# Remove rows with 'NoValue' or 0 (as they are considered invalid)
df_no_col = df_no_col[df_no_col['machine_ps_max_speed'].notna() & (df_no_col['machine_ps_max_speed'] != 0)]

In [65]:
# Define speed bins based on percentiles or fixed thresholds (this is an example)
bins = [0, 2500, 3500, 4500, float('inf')]
labels = ['Low', 'Medium', 'High', 'Very High']

In [66]:
# Create a new column for binned speeds
df_no_col['machine_ps_speed_bin'] = pd.cut(df_no_col['machine_ps_max_speed'], bins=bins, labels=labels, right=False)

# Drop the original 'machine_ps_max_speed' column after binning
df_no_col.drop(columns=['machine_ps_max_speed'], inplace=True)

# Show the result of binning
df_no_col['machine_ps_speed_bin'].value_counts()

machine_ps_speed_bin
Medium       6536
High         3029
Low          2658
Very High       3
Name: count, dtype: int64

In [67]:
del df
del df_removed
del df_selected

#### Justification for `machine_ps_max_numprocessors` Binning:
We are first converting the `machine_ps_max_numprocessors` to numeric values, replacing any invalid data with `NaN`. After that, we perform binning to categorize the number of processors into three groups: `Low`, `Medium`, and `High`. This helps us reduce the complexity of the feature by grouping continuous values into manageable categories. The threshold for binning is set as follows:
- `Low`: 0-4 processors
- `Medium`: 5-12 processors
- `High`: More than 12 processors

Finally, we drop the original `machine_ps_max_numprocessors` column to prevent redundancy and keep the dataset clean and ready for modeling.


In [68]:
# Replace invalid values with NaN
df_no_col['machine_ps_max_numprocessors'] = pd.to_numeric(df_no_col['machine_ps_max_numprocessors'], errors='coerce')

df_no_col = df_no_col[df_no_col['machine_ps_max_numprocessors'].notna()]


In [69]:
# Binning the machine_ps_max_numprocessors into categories
bins = [0, 4, 12, float('inf')]  # Custom bins: 0-4, 5-12, 13+ processors
labels = ['Low', 'Medium', 'High']
df_no_col['machine_ps_max_numprocessors'] = pd.cut(df_no_col['machine_ps_max_numprocessors'], bins=bins, labels=labels, right=False)


df_no_col['machine_ps_max_numprocessors'].value_counts()

machine_ps_max_numprocessors
Medium    10466
High       1635
Low         125
Name: count, dtype: int64

#### Justification for `operating_system`:
The `operating_system` feature contains a placeholder value `-1`, which represents missing or invalid data. We remove these rows to maintain data integrity. 

Next, we transform the feature into a binary format to simplify it for modeling:
- `Mac` is encoded as `1`
- `Windows` is encoded as `0`
- `Both` is encoded as `1` (since it suggests a mixed environment, and we may treat it as a positive case for either Mac or Windows).

This transformation reduces the complexity of the feature while still retaining its informative value for distinguishing between major operating systems.


In [70]:
df_no_col = df_no_col[df_no_col['operating_system'] != '-1']
df_no_col['operating_system'].value_counts()

operating_system
Mac        5700
Windows    5637
Both        885
Name: count, dtype: int64

In [71]:
df_no_col.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12222 entries, 0 to 12248
Data columns (total 47 columns):
 #   Column                           Non-Null Count  Dtype   
---  ------                           --------------  -----   
 0   ps_cluster                       12222 non-null  object  
 1   market_segment                   12222 non-null  object  
 2   sub_type                         12222 non-null  object  
 3   ps_weekday_working_usage         12222 non-null  float64 
 4   ps_weekday_nonworking_usage      12222 non-null  float64 
 5   ps_weekend_usage                 12222 non-null  float64 
 6   ps_doc_average_file_size         12222 non-null  float64 
 7   ps_doc_average_openseconds       12222 non-null  float64 
 8   generic_email                    12222 non-null  bool    
 9   mobile_usage                     12222 non-null  float64 
 10  web_usage                        12222 non-null  float64 
 11  count_camera_make                12222 non-null  float64 
 12  count_cam

#### Justification for `most_used_products`:
The `most_used_products` feature represents the products most frequently used by the users. We first calculate the proportion of each product in the dataset to understand its distribution. 

We set a threshold of 1% for categorizing products as "Other" to group less frequent products together. Any product with a proportion lower than 1% is renamed as "Other". This approach helps simplify the feature and reduces the number of rare categories that might otherwise not provide meaningful insights during analysis or modeling.

By consolidating rare products into "Other", we ensure that the model focuses on the most significant product usage patterns while minimizing noise from low-frequency categories.


In [74]:
product_proportions = df_no_col['most_used_products'].value_counts(normalize=True)
product_proportions

most_used_products
photoshop             0.422599
lightroom             0.196858
acrobat               0.192849
illustrator           0.068810
indesign              0.041810
premiere_pro          0.031337
bridge                0.028146
after_effects         0.008100
adobe_xd              0.004091
dreamweaver           0.002127
audition              0.001555
media_encoder         0.000736
animate               0.000491
character_animator    0.000245
No usage              0.000164
incopy                0.000082
Name: proportion, dtype: float64

In [75]:
rare_products = product_proportions[product_proportions < threshold].index

df_no_col['most_used_products'] = df_no_col['most_used_products'].apply(lambda x: 'Others' if x in rare_products else x)

df_no_col['most_used_products'].value_counts()

most_used_products
photoshop    5165
lightroom    2406
acrobat      2357
Others       2294
Name: count, dtype: int64

In [76]:
df_no_col.head(10)

Unnamed: 0,ps_cluster,market_segment,sub_type,ps_weekday_working_usage,ps_weekday_nonworking_usage,ps_weekend_usage,ps_doc_average_file_size,ps_doc_average_openseconds,generic_email,mobile_usage,...,tb_activity_on_bridge,tb_activity_on_illustrator,tb_activity_on_indesign,tb_activity_on_lightroom,tb_activity_on_media_encoder,tb_activity_on_photoshop,tb_activity_on_premiere_pro,tb_activity_on_adobe_xd,memory_category,machine_ps_speed_bin
0,Photo Enthusiast,COMMERCIAL,Phtoshp Lightrm Bndl,0.621516,0.0,0.378484,14.0,32.6875,True,0.0,...,0.033149,0.0,0.0,0.022099,0.0,0.127072,0.0,0.0,High,Medium
1,Independent Photo Pro,COMMERCIAL,Phtoshp Lightrm Bndl,0.719328,0.014286,0.266387,82.8,40.6,True,0.0,...,0.0,0.0,0.0,0.198895,0.0,0.038674,0.0,0.0,Medium,Low
2,Independent Photo Pro,COMMERCIAL,Phtoshp Lightrm Bndl,0.566488,0.244484,0.189028,12.5,1.1,True,-1.0,...,0.0,0.0,0.0,0.563536,0.0,0.088398,0.0,0.0,Very High,Medium
3,Next Generation Creative,COMMERCIAL,Creative Cloud,1.0,0.0,0.0,0.0,0.0,False,-1.0,...,0.0,0.01105,0.005525,0.0,0.0,0.038674,0.0,0.005525,High,Medium
4,Independent Photo Pro,COMMERCIAL,Creative Cloud,0.999827,5.6e-05,0.000117,49.728155,2.087379,False,0.0,...,0.160221,1.430939,0.022099,0.0,0.0,1.740331,0.0,1.01105,Very High,High
5,Interactive Designer,COMMERCIAL,Others,1.0,0.0,0.0,0.0,10.0,False,0.013889,...,0.0,0.0,0.0,0.0,0.0,0.016575,0.0,0.0,Medium,Low
6,Interactive Designer,EDUCATION,Creative Cloud Indiv,0.437382,0.556136,0.006481,27.857143,0.5,False,0.0,...,0.0,0.314917,0.276243,0.005525,0.033149,0.18232,0.110497,0.0,High,Medium
7,Photo Enthusiast,COMMERCIAL,Others,0.945882,0.054118,0.0,6.5,3.5,True,0.0,...,0.0,0.0,0.0,0.0,0.0,0.038674,0.0,0.0,Low,Low
8,Independent Photo Pro,COMMERCIAL,Creative Cloud,0.933805,0.029332,0.036862,16.153846,1.538462,False,0.0,...,0.0,0.453039,0.0,0.027624,0.0,0.243094,0.0,0.0,High,Medium
9,Next Generation Creative,EDUCATION,Creative Cloud Indiv,0.564964,0.154081,0.280955,0.5,0.5,False,0.0,...,0.0,0.30814,0.0,0.0,0.0,0.046512,0.005814,0.0,High,High
