## Exploratory Data Analysis and Linear Models
This is the process of me getting to know and explore scikit learn library for machine learning. My goals in this process are to:
- Develop Machine Learning skills while exploring scikit learn.
- Create my own repository containing an overview of what could be done using scikit learn.
- And add key points which made me think out loud and surpised me!
- With this I also would like to recall and hone my skills in pandas, numpy, scipy, seaborn, matplotlib and more

## Importing Libraries and Packages
We will be use these packages to help us manipulate the data and visualize the features as well as measure how well our model performed.

In [1]:
import sys
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

sys.path.append('../scripts')
import utilities as ut

sns.set_style("whitegrid")
%matplotlib inline

import os 
print(os.listdir("../data"))

['laptops_train.csv']


## Loading and Viewing Data Set
Before we begin, we should take a look at our data table to see the values that we'll be working with. We can use the head and describe function to look at some sample data and statistics. We can also look at its keys and column names.

In [2]:
laptops_data = pd.read_csv("../data/laptops_train.csv")

laptops_data.head()

Unnamed: 0,Manufacturer,Model Name,Category,Screen Size,Screen,CPU,RAM,Storage,GPU,Operating System,Operating System Version,Weight,Price
0,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,,1.37kg,11912523.48
1,Apple,Macbook Air,Ultrabook,"13.3""",1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,,1.34kg,7993374.48
2,HP,250 G6,Notebook,"15.6""",Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,,1.86kg,5112900.0
3,Apple,MacBook Pro,Ultrabook,"15.4""",IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,,1.83kg,22563005.4
4,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,,1.37kg,16037611.2


Starting our exploratory data analysis before implementing scikit learn linear models on top of this dataset. We will be performing the following operations on our dataset:
- Data Preprocessing
- Segementing data into smaller unique values
- Handling Datatypes
- Data Quality

In [3]:
#check null values and data types in all columns
laptops_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 977 entries, 0 to 976
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Manufacturer              977 non-null    object 
 1   Model Name                977 non-null    object 
 2   Category                  977 non-null    object 
 3   Screen Size               977 non-null    object 
 4   Screen                    977 non-null    object 
 5   CPU                       977 non-null    object 
 6   RAM                       977 non-null    object 
 7   Storage                   977 non-null    object 
 8   GPU                       977 non-null    object 
 9   Operating System          977 non-null    object 
 10  Operating System Version  841 non-null    object 
 11  Weight                    977 non-null    object 
 12  Price                     977 non-null    float64
dtypes: float64(1), object(12)
memory usage: 99.4+ KB


If we see there are columns such as Screen Size, CPU, Storage and Weight attributes with their units in the data. We will be modifying the data such that we will change their datatype.

In [4]:
laptops_data.columns = laptops_data.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')

laptops_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 977 entries, 0 to 976
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   manufacturer              977 non-null    object 
 1   model_name                977 non-null    object 
 2   category                  977 non-null    object 
 3   screen_size               977 non-null    object 
 4   screen                    977 non-null    object 
 5   cpu                       977 non-null    object 
 6   ram                       977 non-null    object 
 7   storage                   977 non-null    object 
 8   gpu                       977 non-null    object 
 9   operating_system          977 non-null    object 
 10  operating_system_version  841 non-null    object 
 11  weight                    977 non-null    object 
 12  price                     977 non-null    float64
dtypes: float64(1), object(12)
memory usage: 99.4+ KB


In [5]:
print(dir(ut))

['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'convert_cpu', 'get_gpu', 'get_storage_type', 'np', 'os', 'pd']


In [6]:
# modifying screen_size data
laptops_data['screen_size'] = laptops_data['screen_size'].str.replace('"','')

# modifying CPU
laptops_data['cpu'] = laptops_data['cpu'].apply(ut.convert_cpu)

# modifying storage to two different column
laptops_data['storage_type'] = laptops_data['storage'].apply(ut.get_storage_type)

laptops_data['storage'] = laptops_data['storage'].str.split(' ').str[0]

laptops_data['ram'] = laptops_data['ram'].str.split('GB').str[0]

laptops_data['gpu'] = laptops_data['gpu'].apply(ut.get_gpu)

laptops_data['operating_system'] = laptops_data['operating_system'].str.replace('macOS','Mac OS')

laptops_data['weight'] = laptops_data['weight'].str.split('kg').str[0]

In [7]:
laptops_data['operating_system_version'] = laptops_data['operating_system_version'].fillna(laptops_data['operating_system_version'].mode()[0])

laptops_data['operating_system_version'] = laptops_data['operating_system_version'].str.replace('X','10')

laptops_data['operating_system_version'] = laptops_data['operating_system_version'].str.replace('10 S','10')

In [8]:
#regualar or more powerfull laptop can never be in crore price
#price will  be in lakh so we string str funtion to get price in lakhs

laptops_data['price'] = laptops_data['price'].astype('str').str[:7]

In [9]:
column_names = list(laptops_data.columns)
new_column_index = 8  # Index after which you want to place the new column

# Reorder columns
column_names.insert(new_column_index, column_names.pop())

# Create a new DataFrame with the reordered columns
laptops_data = laptops_data[column_names]

# Now, the new column is placed after the first column
laptops_data.head()

Unnamed: 0,manufacturer,model_name,category,screen_size,screen,cpu,ram,storage,storage_type,gpu,operating_system,operating_system_version,weight,price
0,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,i5,8,128GB,SSD,Intel,Mac OS,10,1.37,1191252
1,Apple,Macbook Air,Ultrabook,13.3,1440x900,i5,8,128GB,Flash Storage,Intel,Mac OS,10,1.34,7993374
2,HP,250 G6,Notebook,15.6,Full HD 1920x1080,i5,8,256GB,SSD,Intel,No OS,10,1.86,5112900
3,Apple,MacBook Pro,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,i7,16,512GB,SSD,AMD,Mac OS,10,1.83,2256300
4,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,i5,8,256GB,SSD,Intel,Mac OS,10,1.37,1603761
