# Data Cleaning-Printer

The reviews of different Laptops, Smart Phones, Headphones, Smart Watches, DSLR (Professional Cameras), Printers, Monitors, Home Theaters, Routers from Flipkart.
The data preprocessing will be done on each of the dataset seaprately. First we will filter out the data to get equal number of reviews for each rating. The heading of the review is also extracted as the heading also can help in determining the rating.

In [18]:
#import the dataset
import pandas as pd
import numpy as np
pd.set_option('Display.max_columns',None)
pd.set_option('Display.max_rows',None)

df=pd.read_csv("Printer Rating.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,Rating,Heading,Review,Product
0,0,5.0,Perfect product!,"Very nice printer, is worth the purchase. Idea...",Printer
1,1,5.0,Excellent,"great !! easy to use ,, budget friendly,, good...",Printer
2,2,5.0,Brilliant,nice product\nI am a student I ordered this fo...,Printer
3,3,,,,Printer
4,4,,,,Printer


#### Observations:
* The feature unnamed is index. Hence we can drop this feature.

In [19]:
df.drop(['Unnamed: 0'], axis=1, inplace=True)
df.head()

Unnamed: 0,Rating,Heading,Review,Product
0,5.0,Perfect product!,"Very nice printer, is worth the purchase. Idea...",Printer
1,5.0,Excellent,"great !! easy to use ,, budget friendly,, good...",Printer
2,5.0,Brilliant,nice product\nI am a student I ordered this fo...,Printer
3,,,,Printer
4,,,,Printer


### Exploratory Data Analysis

In [20]:
#check the dimensions of the data (Headphone)
df.shape

(14939, 4)

* The dataset has 14939 rows and 4 columns
* The dataset has 1 label - 'Rating' and 3 features

In [21]:
#check the names of columns in dataset
df.columns

Index(['Rating', 'Heading', 'Review', 'Product'], dtype='object')

In [22]:
#check the datatype of each feature
df.dtypes

Rating     float64
Heading     object
Review      object
Product     object
dtype: object

#### Observations:
   * All the feratures are of "object" data type.

In [23]:
#checking if there are any null values in the dataset
df.isna().sum()

Rating     456
Heading    429
Review     427
Product      0
dtype: int64

In [24]:
df[df.isna()]

Unnamed: 0,Rating,Heading,Review,Product
0,,,,
1,,,,
2,,,,
3,,,,
4,,,,
5,,,,
6,,,,
7,,,,
8,,,,
9,,,,


In [25]:
#dropping all numm values
df.dropna(inplace=True)

In [26]:
#cross checking null values
df.isna().sum()

Rating     0
Heading    0
Review     0
Product    0
dtype: int64

In [27]:
df.shape

(14481, 4)

#### Observations:
* There are 14481 rows in the dataset

In [28]:
df['Rating'].value_counts()

5.0    6436
1.0    3391
4.0    2544
3.0    1333
2.0     777
Name: Rating, dtype: int64

#### Observations:
* The labels are inbalanced. 
* The efficiency of review classifier will be better when we have equal or near equal number of reviews for each rating

#### Action:
* We will have as many number of reviews for each rating as there are in the 1 rating (least rating)
* The excess reviews dropped.
* Before dropping the excess reviews we will first check the length of reviews and drop long reviews as it will use up more space.

In [29]:
#check info of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14481 entries, 0 to 14938
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Rating   14481 non-null  float64
 1   Heading  14481 non-null  object 
 2   Review   14481 non-null  object 
 3   Product  14481 non-null  object 
dtypes: float64(1), object(3)
memory usage: 565.7+ KB


#### Observations
   * The info() method thus returns the data type as well as the non-null values and memory usage.
   * Out of the total of 4 columns 3 columns are "object" type while rest of them are float datatype.

In [30]:
#check number of unique values in each class;
df.nunique()

Rating        5
Heading     718
Review     5580
Product       1
dtype: int64

#### Observations:
* The label rating has 5 unique values: 1, 2, 3, 4, 5
* The headings can be duplicate as it is kind of summary of the review.
* The reviews should be unique. Hence, we will drop the duplicate reviews to avoid over-fitting.

In [31]:
df.drop_duplicates(subset='Review', inplace=True)

In [32]:
#cross checking for diplicacy of reviews
print(df.shape)
print(df.nunique())

(5580, 4)
Rating        5
Heading     697
Review     5580
Product       1
dtype: int64


In [33]:
#checking the length of review
df['Review_word_counter']=df['Review'].str.strip().str.len()
df.head()

Unnamed: 0,Rating,Heading,Review,Product,Review_word_counter
0,5.0,Perfect product!,"Very nice printer, is worth the purchase. Idea...",Printer,510
1,5.0,Excellent,"great !! easy to use ,, budget friendly,, good...",Printer,110
2,5.0,Brilliant,nice product\nI am a student I ordered this fo...,Printer,201
9,1.0,Not recommended at all,printer is very good but it's cartage is not g...,Printer,202
10,5.0,nice product,"its a good printer, but i found it little slow...",Printer,119


The maximum number of reviews are for 5 star rating. We will sort the rating in ascending order and length of review. So, that we drop excess number of reviews for 5 star rating and get near balanced dataset.

In [34]:
sorted_df=df.sort_values(by=['Rating','Review_word_counter'])
sorted_df.head()

Unnamed: 0,Rating,Heading,Review,Product,Review_word_counter
566,1.0,Waste of money!,Bad,Printer,3
4036,1.0,Worst experience ever!,pad,Printer,3
5298,1.0,Awesome,top,Printer,3
14259,1.0,Good,NIC,Printer,3
1363,1.0,Worthless,Poor,Printer,4


In [35]:
sorted_df.head(100)

Unnamed: 0,Rating,Heading,Review,Product,Review_word_counter
566,1.0,Waste of money!,Bad,Printer,3
4036,1.0,Worst experience ever!,pad,Printer,3
5298,1.0,Awesome,top,Printer,3
14259,1.0,Good,NIC,Printer,3
1363,1.0,Worthless,Poor,Printer,4
2686,1.0,Not recommended at all,wast,Printer,4
1167,1.0,Absolute rubbish!,Worst,Printer,5
1653,1.0,Worst experience ever!,Wrost,Printer,5
2461,1.0,Did not meet expectations,waste,Printer,5
3218,1.0,Very poor,Canon,Printer,5


In [36]:
sorted_df['Rating'].value_counts()

5.0    2301
1.0    1463
4.0     951
3.0     525
2.0     340
Name: Rating, dtype: int64

The least number of reviews are for rating '2'. We will make the number of reviews for rating '4' and '5' equal to the number of reviews for rating '1'.

In [37]:
df_1=sorted_df[sorted_df['Rating']==1]
df_2=sorted_df[sorted_df['Rating']==2]
df_3=sorted_df[sorted_df['Rating']==3]
df_4=sorted_df[sorted_df['Rating']==4]
df_5=sorted_df[sorted_df['Rating']==5]

In [38]:
df_printer=pd.concat([df_1[:951],df_2,df_3,df_4,df_5[:951]])
df_printer.head()

Unnamed: 0,Rating,Heading,Review,Product,Review_word_counter
566,1.0,Waste of money!,Bad,Printer,3
4036,1.0,Worst experience ever!,pad,Printer,3
5298,1.0,Awesome,top,Printer,3
14259,1.0,Good,NIC,Printer,3
1363,1.0,Worthless,Poor,Printer,4


In [39]:
df_printer.shape

(3718, 5)

In [40]:
#Reshuffling and reindexing the data
from sklearn.utils import shuffle
df_printer=shuffle(df_printer)
df_printer.reset_index(inplace=True,drop=True)
df_printer.head(100)

Unnamed: 0,Rating,Heading,Review,Product,Review_word_counter
0,2.0,"BAKWAS, HP is best",just prints....but print quality is not at all...,Printer,301
1,1.0,Very poor,Very bad.... After refilling its not working.....,Printer,49
2,5.0,Simply awesome,invoice sand,Printer,12
3,4.0,Wonderful,"Awesome printer, for school, college and offic...",Printer,301
4,4.0,Pretty good,Functioning good but not he cartridge get empt...,Printer,73
5,3.0,not so good but overall useful if it is workin...,if it is working properly then it is best choi...,Printer,234
6,5.0,Nic product,Canon prefect,Printer,13
7,4.0,Delightful,Good product but cartridge gets empty very soon.,Printer,48
8,1.0,Worst experience ever!,I will Refare Not Buy This Pinter. This Pinter...,Printer,62
9,4.0,Value-for-money,its nice,Printer,8


In [41]:
df_printer.drop(['Review_word_counter'],axis=1,inplace=True)
df_printer.head()

Unnamed: 0,Rating,Heading,Review,Product
0,2.0,"BAKWAS, HP is best",just prints....but print quality is not at all...,Printer
1,1.0,Very poor,Very bad.... After refilling its not working.....,Printer
2,5.0,Simply awesome,invoice sand,Printer
3,4.0,Wonderful,"Awesome printer, for school, college and offic...",Printer
4,4.0,Pretty good,Functioning good but not he cartridge get empt...,Printer


In [42]:
df_printer.to_csv('Printer Rating_cleaned.csv')

In [43]:
df_printer.shape

(3718, 4)