# Data Cleaning-Smartphone

The reviews of different Laptops, Smart Phones, Headphones, Smart Watches, DSLR (Professional Cameras), Printers, Monitors, Home Theaters, Routers from Flipkart.
The data preprocessing will be done on each of the dataset seaprately. First we will filter out the data to get equal number of reviews for each rating. The heading of the review is also extracted as the heading also can help in determining the rating.

In [70]:
#import the dataset
import pandas as pd
import numpy as np
df=pd.read_csv("Smartphone Rating.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,Rating,Heading,Review,Product
0,0,5.0,Simply awesome,Phone is awesome no problem full paisa wasool ...,Smartphone
1,1,5.0,Perfect product!,Same As Expected From Moto! This is the Best v...,Smartphone
2,2,5.0,Awesome,Good value for money phone. Best past is this ...,Smartphone
3,3,,,,Smartphone
4,4,,,,Smartphone


In [71]:
pd.set_option('Display.max_columns',None)
pd.set_option('Display.max_rows',None)

#### Observations:
* The feature unnamed is index. Hence we can drop this feature.

In [72]:
df.drop(['Unnamed: 0'], axis=1, inplace=True)
df.head()

Unnamed: 0,Rating,Heading,Review,Product
0,5.0,Simply awesome,Phone is awesome no problem full paisa wasool ...,Smartphone
1,5.0,Perfect product!,Same As Expected From Moto! This is the Best v...,Smartphone
2,5.0,Awesome,Good value for money phone. Best past is this ...,Smartphone
3,,,,Smartphone
4,,,,Smartphone


### Exploratory Data Analysis

In [73]:
#check the dimensions of the data (Smartphone)
df.shape

(20080, 4)

* The dataset has 20080 rows and 4 columns
* The dataset has 1 label - 'Rating' and 3 features

In [74]:
#check the names of columns in dataset
df.columns

Index(['Rating', 'Heading', 'Review', 'Product'], dtype='object')

In [75]:
#check the datatype of each feature
df.dtypes

Rating     float64
Heading     object
Review      object
Product     object
dtype: object

#### Observations:
   * All the feratures are of "object" data type.

In [76]:
#checking if there are any null values in the dataset
df.isna().sum()

Rating     1291
Heading    1279
Review     1279
Product       0
dtype: int64

In [77]:
df[df.isna()]

Unnamed: 0,Rating,Heading,Review,Product
0,,,,
1,,,,
2,,,,
3,,,,
4,,,,
5,,,,
6,,,,
7,,,,
8,,,,
9,,,,


In [78]:
#dropping all numm values
df.dropna(inplace=True)

In [79]:
#cross checking null values
df.isna().sum()

Rating     0
Heading    0
Review     0
Product    0
dtype: int64

In [80]:
df.shape

(18789, 4)

#### Observations:
* There are 18789 rows in the dataset

In [81]:
df['Rating'].value_counts()

5.0    10414
4.0     4458
3.0     1763
1.0     1488
2.0      666
Name: Rating, dtype: int64

#### Observations:
* The labels are inbalanced. 
* The efficiency of review classifier will be better when we have equal or near equal number of reviews for each rating

#### Action:
* We will have as many number of reviews for each rating as there are in the 1 rating (least rating)
* The excess reviews dropped.
* Before dropping the excess reviews we will first check the length of reviews and drop long reviews as it will use up more space.

In [82]:
#check info of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18789 entries, 0 to 20079
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Rating   18789 non-null  float64
 1   Heading  18789 non-null  object 
 2   Review   18789 non-null  object 
 3   Product  18789 non-null  object 
dtypes: float64(1), object(3)
memory usage: 733.9+ KB


#### Observations
   * The info() method thus returns the data type as well as the non-null values and memory usage.
   * Out of the total of 4 columns 3 columns are "object" type while rest of them are float datatype.

In [83]:
#check number of unique values in each class;
df.nunique()

Rating         5
Heading      170
Review     11324
Product        1
dtype: int64

#### Observations:
* The label rating has 5 unique values: 1, 2, 3, 4, 5
* The headings can be duplicate as it is kind of summary of the review.
* The reviews should be unique. Hence, we will drop the duplicate reviews to avoid over-fitting.

In [84]:
df.drop_duplicates(subset='Review', inplace=True)

In [85]:
#cross checking for diplicacy of reviews
print(df.shape)
print(df.nunique())

(11324, 4)
Rating         5
Heading      168
Review     11324
Product        1
dtype: int64


In [86]:
#checking the length of review
df['Review_word_counter']=df['Review'].str.strip().str.len()
df.head()

Unnamed: 0,Rating,Heading,Review,Product,Review_word_counter
0,5.0,Simply awesome,Phone is awesome no problem full paisa wasool ...,Smartphone,500
1,5.0,Perfect product!,Same As Expected From Moto! This is the Best v...,Smartphone,219
2,5.0,Awesome,Good value for money phone. Best past is this ...,Smartphone,133
12,4.0,Worth the money,This mobile low budget in a good phone .\nGood...,Smartphone,133
13,4.0,Value-for-money,Rear Camera will be better with updates .but t...,Smartphone,398


The maximum number of reviews are for 5 star rating. We will sort the rating in ascending order and length of review. So, that we drop excess number of reviews for 5 star rating and get near balanced dataset.

In [87]:
sorted_df=df.sort_values(by=['Rating','Review_word_counter'])
sorted_df.head()

Unnamed: 0,Rating,Heading,Review,Product,Review_word_counter
5890,1.0,Unsatisfactory,Oky,Smartphone,3
5934,1.0,Worthless,Bed,Smartphone,3
10307,1.0,Did not meet expectations,Fff,Smartphone,3
805,1.0,Worst experience ever!,Poor,Smartphone,4
7849,1.0,Did not meet expectations,. Mi,Smartphone,4


In [88]:
sorted_df.head(100)

Unnamed: 0,Rating,Heading,Review,Product,Review_word_counter
5890,1.0,Unsatisfactory,Oky,Smartphone,3
5934,1.0,Worthless,Bed,Smartphone,3
10307,1.0,Did not meet expectations,Fff,Smartphone,3
805,1.0,Worst experience ever!,Poor,Smartphone,4
7849,1.0,Did not meet expectations,. Mi,Smartphone,4
2513,1.0,Terrible product,Worst,Smartphone,5
5962,1.0,Useless product,Wrost,Smartphone,5
12483,1.0,Unsatisfactory,Waste,Smartphone,5
17863,1.0,Useless product,Bogas,Smartphone,5
19125,1.0,Waste of money!,Sorry,Smartphone,5


In [51]:
sorted_df['Rating'].value_counts()

5.0    4860
4.0    2272
1.0    1294
3.0    1132
2.0     535
Name: Rating, dtype: int64

The least number of reviews are for rating '2'. We will make the number of reviews for rating '4' and '5' equal to the number of reviews for rating '1'.

In [52]:
df_1=sorted_df[sorted_df['Rating']==1]
df_2=sorted_df[sorted_df['Rating']==2]
df_3=sorted_df[sorted_df['Rating']==3]
df_4=sorted_df[sorted_df['Rating']==4]
df_5=sorted_df[sorted_df['Rating']==5]

In [55]:
print('Number of Reviews for Rating 1: ', df_1.shape[0])
print('Number of Reviews for Rating 2: ', df_2.shape[0])
print('Number of Reviews for Rating 3: ', df_3.shape[0])
print('Number of Reviews for Rating 4: ', df_4.shape[0])
print('Number of Reviews for Rating 5: ', df_5.shape[0])

Number of Reviews for Rating 1:  1294
Number of Reviews for Rating 2:  535
Number of Reviews for Rating 3:  1132
Number of Reviews for Rating 4:  2272
Number of Reviews for Rating 5:  4860


In [58]:
df_4=df_4[:1294]
df_5=df_5[:1294]

(1295, 5)

In [61]:
df_4.shape

(2272, 5)

In [62]:
df_smartphone=pd.concat([df_1,df_2,df_3,df_4[:1294],df_5[:1294]])
df_smartphone.head()

Unnamed: 0,Rating,Heading,Review,Product,Review_word_counter
5890,1.0,Unsatisfactory,Oky,Smartphone,3
5934,1.0,Worthless,Bed,Smartphone,3
10307,1.0,Did not meet expectations,Fff,Smartphone,3
805,1.0,Worst experience ever!,Poor,Smartphone,4
7849,1.0,Did not meet expectations,. Mi,Smartphone,4


In [63]:
df_smartphone.shape

(5549, 5)

In [66]:
#Reshuffling and reindexing the data
from sklearn.utils import shuffle
df_smartphone=shuffle(df_smartphone)
df_smartphone.reset_index(inplace=True,drop=True)
df_smartphone.head(100)

Unnamed: 0,Rating,Heading,Review,Product,Review_word_counter
0,5.0,Mind-blowing purchase,Excellent ...,Smartphone,13
1,1.0,Hated it!,It take lot of time for charging as compare to...,Smartphone,141
2,1.0,Very poor,Waste Flipkart,Smartphone,14
3,5.0,Worth every penny,Osmmm,Smartphone,5
4,3.0,Fair,Its camera is bad i really really hated it. It...,Smartphone,216
5,4.0,Worth the money,Sound system is less,Smartphone,20
6,5.0,Highly recommended,Nice Phone .,Smartphone,12
7,1.0,Useless product,Video display not good and speaker music also ...,Smartphone,149
8,3.0,Fair,"Hanging problem, and price is high",Smartphone,34
9,1.0,Waste of money!,FullBattery charging 7 hours very slow,Smartphone,38


In [67]:
df_smartphone.drop(['Review_word_counter'],axis=1,inplace=True)
df_smartphone.head()

Unnamed: 0,Rating,Heading,Review,Product
0,5.0,Mind-blowing purchase,Excellent ...,Smartphone
1,1.0,Hated it!,It take lot of time for charging as compare to...,Smartphone
2,1.0,Very poor,Waste Flipkart,Smartphone
3,5.0,Worth every penny,Osmmm,Smartphone
4,3.0,Fair,Its camera is bad i really really hated it. It...,Smartphone


In [68]:
df_smartphone.to_csv('Smartphone Rating_cleaned.csv')

In [69]:
df_smartphone.shape

(5549, 4)