<a href="https://colab.research.google.com/github/ayoosh226/Ecommerce-Data-Analysis/blob/main/code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
import re

In [None]:
flipkart= pd.read_csv('/content/smartphones - smartphones.csv')

In [128]:
flipkart.columns

Index(['model', 'price', 'rating', 'sim', 'processor', 'ram', 'battery',
       'display', 'camera', 'card', 'os'],
      dtype='object')

# Summary for data

This smart mobile phones dataset contains 1,021 rows and 11 columns, providing detailed specifications of smartphones from multiple brands. The dataset covers essential aspects such as model, price, ratings, SIM and network support, processor details, RAM and storage, battery capacity, display specifications, camera configurations, memory card support, and operating system version. It offers a structured snapshot of smartphone features and pricing,

# Column Descriptions

1. `model`:- Name of the smartphone, including brand and series/model.
2. `price`:- Price of the phone (in INR).
3. `raing`:- Overall customer rating of the phone (out of 100).
4. `sim`:- Information about SIM support, network types (3G/4G/5G), VoLTE, Wi-Fi, NFC, and connectivity options.
5. `processor`:- Details about the chipset, number of cores, and clock speed of the processor.
6. `ram`:- Information on RAM size and inbuilt storage (ROM).
7. `battery`:- Battery capacity (mAh), charging speed (watts), and fast-charging support.
8. `display`:- Display size, resolution, refresh rate, and type of notch/punch-hole.
9. `camera`:- Camera setup including rear and front cameras with megapixel details.
10. `card`:- Memory card support availability and maximum expandable storage.
11. `os`:-Operating system version of the device (e.g., Android v13, iOS v16).


# 📊 Data Assessing

---

## ✅ Quality Issues  
(problems with accuracy, completeness, validity, consistency)

1. **`model`** – brand names written in inconsistent formats, *(consistency)*  
2. **`price`** – contains unnecessary **₹** symbol *(validity)*  
3. **`price`** – contains commas in numeric values *(validity)*  
4. **`price`** – outlier value (e.g., Namotel listed at ₹99) *(accuracy)*  
5. **`rating`** – missing values *(completeness)*  
6. **`processor`** – incorrect values for some Samsung phones (rows: 642, 647, 649, …) *(validity)*  
7. Non-phone device (`iPod`, row 756) present *(validity)*  
8. **`ram`** – incorrect values in multiple rows *(validity)*  
9. **`battery`** – incorrect or incomplete values in multiple rows *(validity)*  
10. **`display`** – sometimes refresh rate missing *(completeness)*  
11. **`display`** – incorrect values in multiple rows *(validity)*  
12. Foldable phones have scattered or inconsistent info *(validity)*  
13. **`camera`** – uses inconsistent words (“Dual”, “Triple”, “Quad”), front & rear separated by **&** *(consistency)*  
14. **`camera`** – incorrect values in many rows *(validity)*  
15. **`card`** – sometimes contains info about OS or Camera *(validity)*  
16. **`os`** – sometimes contains info about Bluetooth/FM radio *(validity)*  
17. **`os`** – incorrect values in rows (324, 378) *(validity)*  
18. **`os`** – version names written inconsistently (e.g., “Lollipop”) *(consistency)*  
19. Missing values in **camera, card, os** *(completeness)*  
20. **`price`** and **`rating`** stored as strings instead of numeric *(validity)*  

---

## 🧹 Tidiness Issues  
(structure/formatting problems – need splitting or restructuring)

1. **`sim`** → split into: `has_5G`, `has_NFC`, `has_IR_Blaster`  
2. **`ram`** → split into: `RAM`, `ROM`  
3. **`processor`** → split into: `processor_name`, `cores`, `CPU_speed`  
4. **`battery`** → split into: `capacity`, `fast_charging_available`, `wattage`  
5. **`display`** → split into: `screen_size`, `resolution_width`, `resolution_height`, `refresh_rate`  
6. **`camera`** → split into: `rear_camera`, `front_camera`  
7. **`card`** → split into: `card_supported`, `max_expandable_storage`  

---

 - some of the phones are not Smart phone they are `FEATURE PHONE`, we are removing those rows

🔍 *These issues were identified both manually and programmatically.*


In [None]:
pd.set_option("display.max_rows", None)   # show all rows

In [None]:
clean_df = pd.DataFrame()

#### As there are 1000 rows, it difficult to deal with them together, so we will take a bath of 300 and then next batch

These additional info found inside the `model_name` columns
* Parentheses:- ( )

RAM + ROM combinations:-
 - 8GB RAM + 256GB
 - 6GB RAM + 128GB
 - 3GB RAM + 6 B
 - B RAM + 128GB
 - 12GB RAM + 1TB
 - 128GB, 256GB. 512GB. 1TB
 - 6 B + 128GB (or any variant of messy “B RAM + X B/GB/TB”)

* Extra spaces:- Multiple consecutive spaces after replacements

In [None]:
df = flipkart.copy()

# First we deal with CONSISTENCY issue

### `brand` column

In [None]:
#---- there is extra info present inside (), but after careful examination, it is not useful for model_name and brand_name
mask = df['model'].str.contains(
    r"\((?!.*(?:GB|TB|20[1-2][0-9])).*?\)",
    regex=True
)

df_filtered = df.loc[mask, 'model']
print(df_filtered)


754    Apple iPod Touch (7th Gen)
829         Infinix Note 12 (G96)
Name: model, dtype: object


In [None]:
def extract_brand_and_model(series):
    """
    Extract brand_name (first word) and model_name (rest of string) from a pandas Series.

    Cleaning done:
    - Remove parentheses containing GB, TB, or years (2019–2029)
    - Remove 4G/5G from model names
    - Strip extra spaces
    Returns two Series: brand_name, model_name
    """

    # --- Extract brand (first word) ---
    brand_name = series.str.split().str[0].str.strip().str.title()

    # --- Extract model name (remove brand and clean patterns) ---
    temp = series.str.split(' ', n=1).str[1].str.strip()  # remove brand

    # Remove parentheses containing GB/TB/year
    temp = temp.str.replace(r"\((?:[^)]*(?:GB|TB|20[1-2][0-9])[^)]*)\)", "", regex=True).str.strip()

    # Remove 4G or 5G
    temp = temp.str.replace(r"\s*(?:4G|5G)\b", "", regex=True).str.strip()

    # Validation: check if any extra patterns still exist
    gb_tb_year_pattern = r"\((?:[^)]*(?:GB|TB|20[1-2][0-9])[^)]*)\)"
    if temp.str.contains(gb_tb_year_pattern, regex=True).any() or temp.str.contains(r"\b(?:4G|5G)\b", regex=True).any():
        raise ValueError("There are still extra patterns left in the series!")

    return brand_name, temp


   # --- Usage Example ---
brand_name, model_name = extract_brand_and_model(df['model'])

# Value counts for brands
# print(brand_name.value_counts())

# Check if any parentheses left in model_name
print(model_name[model_name.str.contains(r"\([^)]*\)", regex=True)])

754    iPod Touch (7th Gen)
829           Note 12 (G96)
Name: model, dtype: object


In [None]:
model_name.count()

np.int64(1020)

In [None]:
#---- Inserting column back into the df ----
df.insert(0,'brand_name',brand_name)
df.insert(1,'model_name', model_name)

## `os` column

after running this fn there are still some issue left, like unnecesssary values, but those values are not present, so we cannot deal with them, we can replace them using some statistical rules.

In [None]:
def clean_extract_os(df, os_col='os', card_col='card'):
    """
    Clean and extract OS name and version from a dataframe.

    Steps:
    1. Identify rows where the os column has incorrect values (like No, Bluetooth, Memory, Browser, 0.3, 1.3)
       and the card column contains valid OS info (Android, OS, iOS, HarmonyOS, Nucleus)
    2. Copy valid OS values from card to os at those locations
    3. Standardize 'HarmonyOS' to 'Harmony'
    4. Split os into os_name and os_version
    5. Insert os_name and os_version columns into the dataframe
    """

    # --- Identify rows with invalid os values but valid card info ---
    invalid_os_pattern = r'No|Bluetooth|Memory|Browser|0\.3|1\.3'
    valid_card_pattern = r'Android|OS|iOS|HarmonyOS|Nucleus'

    temp_df = df[
        df[card_col].str.contains(valid_card_pattern, na=False, case=False) &
        df[os_col].str.contains(invalid_os_pattern, regex=True, na=False, case=False)
    ]

    # --- Update os values from card column where needed ---
    df.loc[temp_df.index, os_col] = temp_df[card_col].str.strip().values

    # --- Standardize HarmonyOS ---
    df[os_col] = df[os_col].str.replace('HarmonyOS', 'Harmony', case=False, regex=True)

    # --- Split into os_name and os_version ---
    os_name = df[os_col].str.split(' ').str[0]
    os_version = df[os_col].str.split(' ').str[1].str.replace('v','', case=False)

    # --- Insert new columns ---
    df.insert(df.columns.get_loc(os_col)+1, 'os_name', os_name)
    df.insert(df.columns.get_loc('os_name')+1, 'os_version', os_version)

    return df

# --- Usage ---
df = clean_extract_os(df)



## `price` column :- this column only contains ','. And also converting column into decimal.

In [None]:
def clean_price(df, price_col='price'):
    """
    Clean a price column by:
    1. Removing any non-numeric characters
    2. Converting to float
    3. Checking that all values are numeric
    """
    # Remove non-numeric characters
    cleaned = df[price_col].astype(str).str.replace(r'[^\d.]', '', regex=True)

    # Check if all values are numbers
    if not cleaned.str.replace('.', '', 1).str.isdigit().all():
        raise ValueError(f"Some values in '{price_col}' are not numeric after cleaning!")

    # Convert to float
    df[price_col] = cleaned.astype(float)

    return df

# --- Usage ---
df = clean_price(df)



## `Processor` column: there is a lot info that needed to be extracted, like brand name, number of cores, frequency of the CPU. Also some info of this column is in adjacent column, that needs to be adjusted too.

### processor info is leaked into the RAM column, index 532, 611


In [None]:
df['ram'][df['ram'].str.contains('MHz')]

Unnamed: 0,ram
532,"Single Core, 208 MHz Processor"
611,"Dual Core, 500 MHz Processor"


### Putting that info back into the processor column

In [None]:
index = df['ram'].str.contains('MHz')
df.loc[index, 'processor'] = df.loc[index, 'ram']

### Checking if processor info is in `sim` column, but none is found

In [None]:
df['sim'][df['sim'].str.contains('Core|Processor')]

Unnamed: 0,sim


In [None]:
def extract_processor_name(series):
    """
    Clean a pandas Series of processor names:
    - Normalize spaces and title case
    - Remove unwanted prefixes/suffixes (Samsung, 5G)
    - Apply mapping for typos and standardization
    - Handle special processor naming cases
    """

    # --- Initial normalization ---
    series = series.str.split(',').str[0]  # take first part before comma
    series = series.str.strip().str.title()
    series = series.str.replace(r'\s+', ' ', regex=True)

    # --- Remove "5G" suffix ---
    series = series.str.replace(r'\b5G\b', '', regex=True, flags=re.IGNORECASE).str.strip()

    # --- Remove "Samsung" prefix ---
    series = series.str.replace(r'^Samsung\s+', '', regex=True, flags=re.IGNORECASE).str.strip()

    # --- Mapping replacements for common typos / standardization ---
    mapping = {
        "Snapdragon 8 Gen1": "Snapdragon 8 Gen1",
        "Sanpdragon": "Snapdragon",
        "Snapdragon 8+ Gen1": "Snapdragon 8+ Gen1",
        "Snapdragon 8 Gen2": "Snapdragon 8 Gen2",
        "Snapdragon 870": "Snapdragon 870",
        "Snapdragon 888": "Snapdragon 888",
        "Snapdragon 680": "Snapdragon 680",
        "Snapdragon 778G+": "Snapdragon 778G Plus",
        "Snapdragon Qm215": "Snapdragon QM215",
        "Snapdragon 8+ Gen 1": "Snapdragon 8+ Gen1",
    }
    series = series.replace(mapping)

    # --- Special processor fixes ---
    special_cases = {
        r'(Apple\s+)?A13(\s+Bionic)?': "Bionic A13",
        r'Sc6531E': "Unisoc Sc6531E",
        r'Dimensity\s+8100-Max': "Dimensity 8100 Max",
        r'Snapdragon\s+Qm215': "Snapdragon QM215",
        r'Samsung\s+Exynos\s+7885': "Exynos 7885",
        r'Sc9863A': "Unisoc Sc9863A",
    }

    for pattern, replacement in special_cases.items():
        series = series.str.replace(f'^{pattern}$', replacement, regex=True, flags=re.IGNORECASE)

    # --- Replace '-' with space ---
    series = series.str.replace('-', ' ')

    # Optional: check for NaN
    if series.isna().any():
        print("Warning: Some processor names are still NaN after cleaning!")

    return series



#### taking out the core values from processor column

In [None]:
def extract_cores(series):
  """
   Extracts the number of CPU cores from a pandas Series based on keywords.
   Recognizes 'Octa', 'Hexa', 'Quad', and 'Dual' as 8, 6, 4, and 2 cores respectively.
   Returns NaN if no keyword is found.
  """

  number_of_cores = np.select(
      [
          df['processor'].str.contains("Octa"),
          df['processor'].str.contains("Hexa"),
          df['processor'].str.contains("Quad"),
          df['processor'].str.contains("Dual")
      ],
      [8, 6, 4, 2],
      default=np.nan
  )
  return number_of_cores



#### checking for the frequency in processor column. So it contain speed in Ghz, Mhz as well, So we convert the speed into one unit, i.e. GHz. using the formula GHz = MHz/1000

In [None]:
def extracting_frequency(series):
    """
    Extracts processor frequencies (GHz or MHz) from a pandas Series.
    Converts MHz values to GHz and returns a NumPy array of frequencies.
    Rows without a match become NaN.
    """

    pattern = r'([\d\.]+)\s*([GM]Hz)'
    # Extract numeric value and unit (GHz or MHz) using regex
    freq_data = series.str.extract(pattern)

    # Convert to float
    freq_values = freq_data[0].astype(float)

    # Convert MHz to GHz using vectorized np.where
    freq_values = np.where(freq_data[1] == 'MHz', freq_values / 1000, freq_values)

    return freq_values




In [None]:
#----- CALLING SEPRATE Fn TO CLEAN & EXTRACT INFO FROM PROCESSOR COLUMN  ----
processor_name = extract_processor_name(df['processor'])
number_of_cores = extract_cores(df['processor'])
speed_of_processor = extracting_frequency(df['processor'])

In [None]:
#---- PUTTING INFO BACK INTO COLUMN ----
df.insert(7,'processor_name', processor_name)
df.insert(8,'number_of_cores', number_of_cores)
df.insert(9,'speed_of_processor(GHz)', speed_of_processor)

In [None]:
df.sample(5)

Unnamed: 0,brand_name,model_name,model,price,rating,sim,processor,processor_name,number_of_cores,speed_of_processor(GHz),ram,battery,display,camera,card,os,os_name,os_version
509,Realme,Narzo 50,Realme Narzo 50 (6GB RAM + 128GB),15499.0,77.0,"Dual Sim, 3G, 4G, VoLTE, Wi-Fi","Helio G96, Octa Core, 2 GHz Processor",Helio G96,8.0,2.0,"6 GB RAM, 128 GB inbuilt",5000 mAh Battery with 33W Fast Charging,"6.6 inches, 1080 x 2412 px, 120 Hz Display wit...",50 MP + 2 MP + 2 MP Triple Rear & 16 MP Front ...,"Memory Card Supported, upto 256 GB",Android v11,Android,11
598,Xiaomi,Redmi Note 8 2021,Xiaomi Redmi Note 8 2021,9990.0,75.0,"Dual Sim, 3G, 4G, VoLTE, Wi-Fi, IR Blaster","Helio G85, Octa Core, 2 GHz Processor",Helio G85,8.0,2.0,"4 GB RAM, 64 GB inbuilt",4000 mAh Battery with 18W Fast Charging,"6.3 inches, 1080 x 2340 px Display with Water ...",48 MP Quad Rear & 13 MP Front Camera,Memory Card Supported,Android v11,Android,11
400,Jio,JioPhone 2,Jio JioPhone 2,2999.0,,"Dual Sim, 3G, 4G, VoLTE, Wi-Fi","Dual Core, 1 GHz Processor",Dual Core,2.0,1.0,"512 MB RAM, 4 GB inbuilt",2000 mAh Battery,"2.4 inches, 320 x 240 px Display",2 MP Rear & 0.3 MP Front Camera,"Memory Card Supported, upto 128 GB",KAI OS,KAI,OS
774,Samsung,Galaxy S22 Plus,Samsung Galaxy S22 Plus 5G (8GB RAM + 256GB),88999.0,88.0,"Dual Sim, 3G, 4G, 5G, VoLTE, Wi-Fi, NFC","Snapdragon 8 Gen1, Octa Core, 3 GHz Processor",Snapdragon 8 Gen1,8.0,3.0,"8 GB RAM, 256 GB inbuilt",4500 mAh Battery with 45W Fast Charging,"6.6 inches, 1080 x 2340 px, 120 Hz Display wit...",50 MP + 12 MP + 10 MP Triple Rear & 10 MP Fron...,Android v12,Android v12,Android,12
608,Namotel,Achhe Din,Namotel Achhe Din,99.0,,"Dual Sim, 3G, Wi-Fi","1 GB RAM, 4 GB inbuilt",1 Gb Ram,,,1325 mAh Battery,"4 inches, 720 x 1280 px Display",2 MP Rear & 0.3 MP Front Camera,Android v5.0 (Lollipop),Bluetooth,,,


### 'ram' column: This column contains memory values in both GB and MB, along with some extraneous or non-standard entries. It includes information for both RAM and ROM, which need to be separated.

### Some data of ram column is in `battery` column, brining it back into `ram` column

In [None]:
df['battery'][df['battery'].str.contains('GB|MB')]

Unnamed: 0,battery
376,"48 MB RAM, 128 MB inbuilt"
551,"64 MB RAM, 128 MB inbuilt"
582,"48 MB RAM, 128 MB inbuilt"
611,"32 MB RAM, 32 MB inbuilt"
817,"48 MB RAM, 128 MB inbuilt"
882,"48 MB RAM, 128 MB inbuilt"
1000,"32 MB RAM, 32 MB inbuilt"


In [None]:
def extract_memory(series):
    """
    Fix misplaced RAM values from 'battery' column
    and extract RAM and ROM from the 'ram' column.

    Returns two Series: ram, rom
    """
    # ---- Step 1: Bring misplaced values back from 'battery' to 'ram'
    index = df['battery'][df['battery'].str.contains('GB|MB', na=False)].index
    df.loc[index, 'ram'] = df.loc[index, 'battery']

    # ---- Step 2: Extract RAM (before 'GB RAM')
    ram = df['ram'].str.extract(r'(\d+)\s*GB\s*RAM')[0].astype(float)

    # ---- Step 3: Extract ROM (before 'GB inbuilt')
    rom = df['ram'].str.extract(r'(\d+)\s*GB\s*inbuilt')[0].astype(float)

    return ram, rom


# Usage
ram, rom = extract_memory(df['battery'])
df.insert(10,'ram_extracted', ram)
df.insert(11,'rom_extracted', rom)

In [None]:
df.columns

Index(['brand_name', 'model_name', 'model', 'price', 'rating', 'sim',
       'processor', 'processor_name', 'number_of_cores',
       'speed_of_processor(GHz)', 'ram_extracted', 'rom_extracted', 'ram',
       'battery', 'display', 'camera', 'card', 'os', 'os_name', 'os_version'],
      dtype='object')

### `battery` column, this contain three info battery capacity, battery wattage and fast charging info

In [None]:
df['battery'].str.contains('Battery').sum()

np.int64(987)

In [None]:
# #---- checking if accidently battery column info ended up in adjacent column.i.e. ram and battery ----
# #---- after carefully looking, battery column has this word 'Battery' in nearly all rows, so chcking this into next column ----
index_battery = df['display'][df['display'].str.contains('Battery')].index
df.loc[index_battery,'battery'] = df.loc[index_battery,'display']

index_ram = df['ram'][df['ram'].str.contains('Battery')].index
df.loc[index_ram,'battery'] = df.loc[index_ram,'ram']


In [None]:
def extract_battery(series):
  battery_capacity = series.str.split().str[0]
  wattage = series.str.extract(r'(\d+)\s*W')[0].astype(float)
  fast_charging = np.where(series.str.contains('Fast Charging'), 'Yes', 'No')

  return battery_capacity, wattage, fast_charging

battery_capacity, wattage, fast_charging = extract_battery(df['battery'])

df.insert(13,'battery_capacity', battery_capacity)
df.insert(14,'wattage', wattage)
df.insert(15,'fast_charging', fast_charging)




### `display` column, this contain four info, size of screen, pixels, refersh rate and info about pucnh hole


In [None]:
df['display'].str.contains('inches', na=False).sum()

np.int64(987)

In [None]:
index= df['camera'][df['camera'].str.contains('inches', na=False)].index
df.loc[index,'display'] = df.loc[index,'camera']

index = df['battery'][df['battery'].str.contains('inches', na=False)].index
df.loc[index,'display'] = df.loc[index,'battery']


In [None]:
df['display'].str.contains('inches', na=False).sum()

np.int64(1005)

In [None]:
def extract_display_info(series):
  screen_size = series.str.split(',').str[0].str.split(' ').str[0]
  resolution = series.str.extract(r'(\d+\s*[x×]\s*\d+)\s*px')[0]
  refresh_rate = series.str.extract(r'(\d+)\s*Hz')[0].astype(float)
  design = series.str.extract(r'(Punch Hole|Water Drop)', flags=re.IGNORECASE)[0]

  return screen_size,resolution, refresh_rate, design

screen_size,resolution, refresh_rate, design = extract_display_info(df['display'])

df.insert(17,'screen_size', screen_size)
df.insert(18,'resolution', resolution)
df.insert(19,'refresh_rate', refresh_rate)
df.insert(20,'design', design)

In [None]:
df.columns

Index(['brand_name', 'model_name', 'model', 'price', 'rating', 'sim',
       'processor', 'processor_name', 'number_of_cores',
       'speed_of_processor(GHz)', 'ram_extracted', 'rom_extracted', 'ram',
       'battery_capacity', 'wattage', 'fast_charging', 'battery',
       'screen_size', 'resolution', 'refresh_rate', 'design', 'display',
       'camera', 'card', 'os', 'os_name', 'os_version'],
      dtype='object')

In [None]:
df['design']

Unnamed: 0,design
0,Punch Hole
1,Punch Hole
2,Water Drop
3,Punch Hole
4,Punch Hole
5,Water Drop
6,
7,Punch Hole
8,Punch Hole
9,Punch Hole


### `camera column`: This column contains information about both the rear and front cameras, including details on the number of cameras (e.g., dual, triple, quad) and their respective specifications.

In [None]:
df['camera'].str.contains('MP').sum()

954

In [None]:
index = df['card'][df['card'].str.contains('MP',na=False)].index
df.loc[index,'camera'] = df.loc[index,'card']

index = df['display'][df['display'].str.contains('MP',na=False)].index
df.loc[index,'camera'] = df.loc[index,'display']

In [None]:
def extract_camera_info(series):
    """
    Extract rear camera specs, front camera specs, and camera type (Dual/Triple/Quad)
    from a pandas Series containing camera description.

    Parameters
    ----------
    series : pd.Series
        Text column containing camera info (e.g., '64 MP Quad Rear & 16 MP Front Camera').

    Returns
    -------
    rear_camera : pd.Series
        Rear camera megapixels as string (e.g., '64+2+2').
    front_camera : pd.Series
        Front camera megapixels as string (e.g., '16').
    camera_type : pd.Series
        Camera type as string ('Dual', 'Triple', 'Quad').
    """
    # Extract rear camera numbers (handles multiple cameras separated by +)
    rear_camera = series.str.extract(r'([\d+\s*MP\+?]+)\s*(Dual|Triple|Quad)\s*Rear')[0]
    rear_camera = rear_camera.str.replace(r'\s*MP', '', regex=True).str.replace(r'\s+', '', regex=True)

    # Extract camera type (Dual, Triple, Quad)
    camera_type = series.str.extract(r'(Dual|Triple|Quad)\s*Rear')[0]

    # Extract front camera number
    front_camera = series.str.extract(r'([\d+\s*MP]+)\s*Front')[0]
    front_camera = front_camera.str.replace(r'\s*MP', '', regex=True).str.replace(r'\s+', '', regex=True)

    return camera_type, rear_camera, front_camera

camera_type, rear_camera, front_camera = extract_camera_info(df['camera'])

#---- exracting camera type ----
rear_camera1 = rear_camera.str.split('+').str[0]
rear_camera2 = rear_camera.str.split('+').str[1]
rear_camera3 = rear_camera.str.split('+').str[2]

In [None]:
df.insert(22,'camera_type', camera_type)
df.insert(23,'rear_camera1', rear_camera1)
df.insert(24,'rear_camera2', rear_camera2)
df.insert(25,'rear_camera3', rear_camera3)
df.insert(26,'front_camera', front_camera)

In [None]:
df.columns

Index(['brand_name', 'model_name', 'model', 'price', 'rating', 'sim',
       'processor', 'processor_name', 'number_of_cores',
       'speed_of_processor(GHz)', 'ram_extracted', 'rom_extracted', 'ram',
       'battery_capacity', 'wattage', 'fast_charging', 'battery',
       'screen_size', 'resolution', 'refresh_rate', 'design', 'display',
       'camera_type', 'rear_camera1', 'rear_camera2', 'rear_camera3',
       'front_camera', 'camera', 'card', 'os', 'os_name', 'os_version'],
      dtype='object')

### `card` column contain info related to weather it contain memory card or not, also what kind of memory card is it.

In [None]:
df['card'].str.contains('Memory').sum()

778

In [None]:
index = df['os'][df['os'].str.contains('Memory', na=False)].index
df.loc[index,'card'] = df.loc[index,'os']

index = df['camera'][df['camera'].str.contains('Memory', na=False)].index
df.loc[index,'card'] = df.loc[index,'camera']

In [None]:
df['card'][df['card'].str.contains('Not', na=False)].count()


np.int64(129)

In [None]:
def extract_card(series):
    """
    Extract memory card support info:
    - Whether card is supported or not
    - Whether it is hybrid
    - Capacity in GB (standardized, 0 if not supported)
    """
    # Card supported or not
    card_supported = np.where(series.str.contains('Not', na=False), 'No', 'Yes')

    # Hybrid slot
    card_hybrid = np.where(series.str.contains('Hybrid', na=False), 'Yes', 'No')

    # Extract number + unit (GB or TB)
    capacity = series.str.extract(r'(\d+)\s*(GB|TB)', expand=True)

    # Convert numeric part to float
    values = capacity[0].astype(float)

    # Standardize: convert TB → GB, leave GB unchanged
    memory_card_capacity = np.where(capacity[1] == 'TB', values * 1024, values)

    # Force 0 capacity if card is not supported
    memory_card_capacity = np.where(card_supported == 'No', 0, memory_card_capacity)

    return card_supported, card_hybrid,memory_card_capacity

df['card_supported'], df['card_hybrid'], df['memory_card_capacity'] = extract_card(df['card'])






In [None]:
df.columns

Index(['brand_name', 'model_name', 'model', 'price', 'rating', 'sim',
       'processor', 'processor_name', 'number_of_cores',
       'speed_of_processor(GHz)', 'ram_extracted', 'rom_extracted', 'ram',
       'battery_capacity', 'wattage', 'fast_charging', 'battery',
       'screen_size', 'resolution', 'refresh_rate', 'design', 'display',
       'camera_type', 'rear_camera1', 'rear_camera2', 'rear_camera3',
       'front_camera', 'camera', 'card', 'os', 'os_name', 'os_version',
       'card_supported', 'card_hybrid', 'memory_card_capacity'],
      dtype='object')

In [131]:
df = df.drop(['model', 'rating', 'sim', 'processor', 'ram', 'battery',
       'display', 'camera', 'card', 'os'], axis=1)

In [132]:
df.to_csv("output.csv", index=False, encoding="utf-8")