# Data Cleaning-DSLR

The reviews of different Laptops, Smart Phones, Headphones, Smart Watches, DSLR (Professional Cameras), Printers, Monitors, Home Theaters, Routers from Flipkart.
The data preprocessing will be done on each of the dataset seaprately. First we will filter out the data to get equal number of reviews for each rating. The heading of the review is also extracted as the heading also can help in determining the rating.

In [29]:
#import the dataset
import pandas as pd
import numpy as np
pd.set_option('Display.max_columns',None)
pd.set_option('Display.max_rows',None)

df=pd.read_csv("DSLR Rating.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,Rating,Heading,Review,Product
0,0,5.0,Brilliant,It's a nice budget entry level mirrorless came...,DSLR
1,1,4.0,Good choice for the money,"For a beginner, this Camera seems to be the be...",DSLR
2,2,5.0,Wonderful,it's a awesome camera loved it.\nreally underr...,DSLR
3,3,,,,DSLR
4,4,,,,DSLR


#### Observations:
* The feature unnamed is index. Hence we can drop this feature.

In [30]:
df.drop(['Unnamed: 0'], axis=1, inplace=True)
df.head()

Unnamed: 0,Rating,Heading,Review,Product
0,5.0,Brilliant,It's a nice budget entry level mirrorless came...,DSLR
1,4.0,Good choice for the money,"For a beginner, this Camera seems to be the be...",DSLR
2,5.0,Wonderful,it's a awesome camera loved it.\nreally underr...,DSLR
3,,,,DSLR
4,,,,DSLR


### Exploratory Data Analysis

In [31]:
#check the dimensions of the data (Headphone)
df.shape

(5729, 4)

* The dataset has 5729 rows and 4 columns
* The dataset has 1 label - 'Rating' and 3 features

In [32]:
#check the names of columns in dataset
df.columns

Index(['Rating', 'Heading', 'Review', 'Product'], dtype='object')

In [33]:
#check the datatype of each feature
df.dtypes

Rating     float64
Heading     object
Review      object
Product     object
dtype: object

#### Observations:
   * All the feratures are of "object" data type.

In [34]:
#checking if there are any null values in the dataset
df.isna().sum()

Rating     118
Heading    118
Review     118
Product      0
dtype: int64

In [35]:
df[df.isna()]

Unnamed: 0,Rating,Heading,Review,Product
0,,,,
1,,,,
2,,,,
3,,,,
4,,,,
5,,,,
6,,,,
7,,,,
8,,,,
9,,,,


In [36]:
#dropping all numm values
df.dropna(inplace=True)

In [37]:
#cross checking null values
df.isna().sum()

Rating     0
Heading    0
Review     0
Product    0
dtype: int64

In [38]:
df.shape

(5611, 4)

#### Observations:
* There are 5611 rows in the dataset

In [39]:
df['Rating'].value_counts()

5.0    3934
4.0    1155
3.0     252
1.0     212
2.0      58
Name: Rating, dtype: int64

#### Observations:
* The labels are inbalanced. 
* The efficiency of review classifier will be better when we have equal or near equal number of reviews for each rating

#### Action:
* We will have as many number of reviews for each rating as there are in the 1 rating (least rating)
* The excess reviews dropped.
* Before dropping the excess reviews we will first check the length of reviews and drop long reviews as it will use up more space.

In [40]:
#check info of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5611 entries, 0 to 5728
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Rating   5611 non-null   float64
 1   Heading  5611 non-null   object 
 2   Review   5611 non-null   object 
 3   Product  5611 non-null   object 
dtypes: float64(1), object(3)
memory usage: 219.2+ KB


#### Observations
   * The info() method thus returns the data type as well as the non-null values and memory usage.
   * Out of the total of 4 columns 3 columns are "object" type while rest of them are float datatype.

In [41]:
#check number of unique values in each class;
df.nunique()

Rating        5
Heading      95
Review     4097
Product       1
dtype: int64

#### Observations:
* The label rating has 5 unique values: 1, 2, 3, 4, 5
* The headings can be duplicate as it is kind of summary of the review.
* The reviews should be unique. Hence, we will drop the duplicate reviews to avoid over-fitting.

In [42]:
df.drop_duplicates(subset='Review', inplace=True)

In [43]:
#cross checking for diplicacy of reviews
print(df.shape)
print(df.nunique())

(4097, 4)
Rating        5
Heading      94
Review     4097
Product       1
dtype: int64


In [44]:
#checking the length of review
df['Review_word_counter']=df['Review'].str.strip().str.len()
df.head()

Unnamed: 0,Rating,Heading,Review,Product,Review_word_counter
0,5.0,Brilliant,It's a nice budget entry level mirrorless came...,DSLR,433
1,4.0,Good choice for the money,"For a beginner, this Camera seems to be the be...",DSLR,508
2,5.0,Wonderful,it's a awesome camera loved it.\nreally underr...,DSLR,346
13,5.0,Best in the market!,People who ever plan to progress from novice t...,DSLR,183
14,4.0,Good choice,Good Product. Great build. Decent battery life...,DSLR,82


The maximum number of reviews are for 5 star rating. We will sort the rating in ascending order and length of review. So, that we drop excess number of reviews for 5 star rating and get near balanced dataset.

In [45]:
sorted_df=df.sort_values(by=['Rating','Review_word_counter'])
sorted_df.head()

Unnamed: 0,Rating,Heading,Review,Product,Review_word_counter
1000,1.0,Terrible product,Bad,DSLR,3
2360,1.0,Useless product,bad,DSLR,3
3393,1.0,Don't waste your money,Miss,DSLR,4
743,1.0,Useless product,Worst,DSLR,5
2640,1.0,Very poor,Waste,DSLR,5


In [46]:
sorted_df.head(100)

Unnamed: 0,Rating,Heading,Review,Product,Review_word_counter
1000,1.0,Terrible product,Bad,DSLR,3
2360,1.0,Useless product,bad,DSLR,3
3393,1.0,Don't waste your money,Miss,DSLR,4
743,1.0,Useless product,Worst,DSLR,5
2640,1.0,Very poor,Waste,DSLR,5
612,1.0,Not recommended at all,Very bad,DSLR,8
3026,1.0,Horrible,Not good,DSLR,8
3433,1.0,Worst experience ever!,Very bed,DSLR,8
3804,1.0,Unsatisfactory,not good,DSLR,8
3909,1.0,Unsatisfactory,lite wet,DSLR,8


In [47]:
sorted_df['Rating'].value_counts()

5.0    2840
4.0     806
3.0     204
1.0     194
2.0      53
Name: Rating, dtype: int64

The least number of reviews are for rating '2'. We will make the number of reviews for rating '4' and '5' equal.

In [48]:
df_DSLR=pd.concat([df_1,df_2,df_3,df_4,df_5[:806]])
df_DSLR.head()

Unnamed: 0,Rating,Heading,Review,Product,Review_word_counter
3393,1.0,Don't waste your money,Miss,DSLR,4
743,1.0,Useless product,Worst,DSLR,5
612,1.0,Not recommended at all,Very bad,DSLR,8
3433,1.0,Worst experience ever!,Very bed,DSLR,8
3909,1.0,Unsatisfactory,lite wet,DSLR,8


In [49]:
df_DSLR.shape

(1977, 5)

In [50]:
#Reshuffling and reindexing the data
from sklearn.utils import shuffle
df_DSLR=shuffle(df_DSLR)
df_DSLR.reset_index(inplace=True,drop=True)
df_DSLR.head(100)

Unnamed: 0,Rating,Heading,Review,Product,Review_word_counter
0,5.0,Must buy!,awesome product....,DSLR,19
1,4.0,Really Nice,very good for this price range... beginners wi...,DSLR,61
2,2.0,Moderate,not as expected picture quality,DSLR,31
3,1.0,Worthless,So satisfying with the product and delivery se...,DSLR,96
4,3.0,Just okay,The build quality is ok .....plastic but looks...,DSLR,437
5,5.0,Terrific purchase,Nicr,DSLR,4
6,1.0,Worst experience ever!,MEMORY CARD NOT RECEIVED,DSLR,24
7,4.0,Value-for-money,Very good Camera,DSLR,16
8,5.0,Worth every penny,best products,DSLR,13
9,3.0,Does the job,Jus ok,DSLR,6


In [51]:
df_DSLR.drop(['Review_word_counter'],axis=1,inplace=True)
df_DSLR.head()

Unnamed: 0,Rating,Heading,Review,Product
0,5.0,Must buy!,awesome product....,DSLR
1,4.0,Really Nice,very good for this price range... beginners wi...,DSLR
2,2.0,Moderate,not as expected picture quality,DSLR
3,1.0,Worthless,So satisfying with the product and delivery se...,DSLR
4,3.0,Just okay,The build quality is ok .....plastic but looks...,DSLR


In [52]:
df_DSLR.to_csv('DSLR Rating_cleaned.csv')

In [53]:
df_DSLR.shape

(1977, 4)