### <b>**Regression** </b>

### **Agenda**

#### <b> **Problem Statement:** </b>

Google Play Store team is about to launch a new feature wherein, certain apps that are promising are boosted in visibility. The boost will manifest in multiple ways including higher priority in recommendations sections (“Similar apps”, “You might also like”, “New and updated games”). These will also get a boost in search results visibility.  This feature will help bring more attention to newer apps that have the potential.

#### <b> **Analysis to be done:** </b>

The problem is to identify the apps that are going to be good for Google to promote. App ratings, which are provided by the customers, are always great indicators of the goodness of the app. The problem reduces to: predict which apps will have high ratings.

#### <b>**Dataset**</b>

Google Play Store data (**googleplaystore.csv**)

Link: https://www.dropbox.com/sh/i06ohrau3ucfgbm/AACeYXumL56543KnDNQFlj8ma?dl=0


#### <b> **Data Dictionary:**</b>

|Variables|Description|
|:-|:-|
|App| Application name|
|Category|Category to which the app belongs|
|Rating|Overall user rating of the app|
|Reviews|Number of user reviews for the app|
|Size|Size of the app|
|Installs|Number of user downloads/installs for the app|
|Type|Paid or Free|
|Price|Price of the app|
|Content Rating|Age group the app is targeted at - Children / Mature 21+ / Adult|
|Genres|An app can belong to multiple genres (apart from its main category)<br>For example, a musical family game will belong to Music, Game, Family genres|
|Last Updated|Date when the app was last updated on Play Store|
|Current Ver|Current version of the app available on Play Store|
|Android Ver|Minimum required Android version|

#### <b> **Solution:**</b>

#### <b> **Import Libraries**</b>

In [None]:
#Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt, seaborn as sns
%matplotlib inline

#### <b> **Import and Check Dataset**</b>

In [None]:
inp0 = pd.read_csv("./googleplaystore.csv")

In [None]:
# Check first five rows
inp0.head(2)

#### <b> Observations </b>
The data will be displayed on the screen.

In [None]:
#Check number of columns and rows, and data types
inp0.info()

#### <b> **Check Data Types**</b>

In [None]:
#checking datatypes
inp0.dtypes

#### <b> **Finding and Treating Null Values**</b>

In [None]:
#Finding count of null values
inp0.isnull().sum(axis=0)

In [None]:
#Dropping the records with null ratings
#This is done because ratings is our target variable
inp0.dropna(how ='any', inplace = True)

In [None]:
inp0.isnull().sum(axis=0)

#### <b> Handling the Variables </b>

**1. Clean the price column**

In [None]:
#Cleaning the price column
inp0.Price.value_counts()[:5]

#### <b> Observations </b>
Some have dollars, some have 0
* We need to conditionally handle this.
* First, let's modify the column to take 0 if value is 0, else take the first letter onwards.

In [None]:
#Modifying the column
inp0['Price'] = inp0.Price.map(lambda x: 0 if x=='0' else float(x[1:]))

**The other columns with numeric data are:<br>**
1. Reviews
2. Installs
3. Size

**2. Convert reviews to numeric**

In [None]:
#Converting reviews to numeric
inp0.Reviews = inp0.Reviews.astype("int32")

In [None]:
inp0.Reviews.describe()

**3. Handle the installs column**

In [None]:
#Handling the installs column
inp0.Installs.value_counts()

We'll need to remove the commas and the plus signs.

<b> Defining function for the same </b>

In [None]:
def clean_installs(val):
    return int(val.replace(",","").replace("+",""))

In [None]:
inp0.Installs = inp0.Installs.map(clean_installs)

In [None]:
inp0.Installs.describe()

**4. Handle the app size field**

In [None]:
#Handling the app size field
def change_size(size):
    if 'M' in size:
        x = size[:-1]
        x = float(x)*1000
        return(x)
    elif 'k' == size[-1:]:
        x = size[:-1]
        x = float(x)
        return(x)
    else:
        return None

In [None]:
inp0["Size"] = inp0["Size"].map(change_size)

In [None]:
inp0.Size.describe()

In [None]:
#Filling Size which had NA
inp0.Size.fillna(method = 'ffill', inplace = True)

In [None]:
#Checking datatypes
inp0.dtypes

#### **Sanity checks**

1. Average rating should be between 1 and 5, as only these values are allowed on Play Store. Drop any rows that have a value outside this range.

In [None]:
#Checking the rating
inp0.Rating.describe()

#### <b> Observations </b>

Min is 1 and max is 5. None of the values have rating outside the range.

2. Reviews should not be more than installs as only those who installed can review the app.

Checking if reviews are more than installs. Counting total rows like this.

In [None]:
#Checking and counting the rows
len(inp0[inp0.Reviews > inp0.Installs])

In [None]:
inp0[inp0.Reviews > inp0.Installs]

In [None]:
inp0 = inp0[inp0.Reviews <= inp0.Installs].copy()

In [None]:
inp0.shape

3. For free apps **(Type == “Free”)**, the price should not be **> 0**. Drop any such rows.

In [None]:
len(inp0[(inp0.Type == "Free") & (inp0.Price>0)])

#### **EDA**

#### <b> Box Plot: Price</b>

In [None]:
#Are there any outliers? Think about the price of usual apps on the Play Store.
sns.boxplot(inp0.Price)
plt.show()

#### <b> Box Plot: Reviews</b>

In [None]:
#Are there any apps with very high number of reviews? Do the values seem right?
sns.boxplot(inp0.Reviews)
plt.show()

#### **Checking Distribution and Skewness:**

How are the ratings distributed? Is it more toward higher ratings?

##### **Distribution of Ratings**

In [None]:
#Distributing the ratings
inp0.Rating.plot.hist()
#Show plot
plt.show()

##### **Histogram: Size**

In [None]:
inp0['Size'].plot.hist()
#Show plot
plt.show()

#### <b> Observations </b>
A histogram is plotted with ratings on the x-axis and frequency on the y-axis, and the ratings are distributed.

In [None]:
#Pair plot
sns.pairplot(data=inp0)

##### **Outlier Treatment:**


##### **1. Price:**

From the box plot, it seems like there are some apps with very high prices. A price of $200 for an application on the Play Store is very high and suspicious.
Check the records that have very high price:
Is 200 a high price?

In [None]:
#Checking the records
len(inp0[inp0.Price > 200])

In [None]:
inp0[inp0.Price > 200]

In [None]:
inp0 = inp0[inp0.Price <= 200].copy()

inp0.shape

##### **2. Reviews:**

Very few apps have very high number of reviews. These are all star apps that don’t help with the analysis and, in fact, will skew it. Drop records having more than 2 million reviews.

In [None]:
#Dropping the records with more than 2 million reviews
inp0 = inp0[inp0.Reviews <= 2000000]
inp0.shape

##### **3. Installs:**

There seem to be some outliers in this field too. Apps having a very high number of installs should be dropped from the analysis.
Find out the different percentiles – 10, 25, 50, 70, 90, 95, 99.

Decide a threshold as the cutoff for outliers and drop records having values more than the threshold.




In [None]:
#Dropping the apps that have a very high number of installs
inp0.Installs.quantile([0.1, 0.25, 0.5, 0.70, 0.9, 0.95, 0.99])

#### <b> Observations </b>

Looks like there are just 1% of apps having more than 100M installs. These apps might be genuine, but will definitely skew our analysis.  
We need to drop these.


In [None]:
#Dropping the apps with more than 100M installs
len(inp0[inp0.Installs >= 1000000000])

In [None]:
inp0 = inp0[inp0.Installs < 1000000000].copy()
inp0.shape

In [None]:
#Importing warnings
import warnings
warnings.filterwarnings("ignore")

#### **Bi-variate Analysis:**

Let’s look at how the available predictors relate to the variable of interest, i.e., our target variable rating. Make scatter plots (for numeric features) and box plots (for character features) to assess the relationhips between rating and the other features.

##### **1.	Make scatter plot/join plot for Rating vs. Price**

In [None]:
#What pattern do you observe? Does rating increase with price?
sns.jointplot(x=inp0.Price, y=inp0.Rating)

##### **2.	Make scatter plot/joinplot for Rating vs Size**

In [None]:
#Are heavier apps rated better?
sns.jointplot(x = inp0.Size, y = inp0.Rating)

##### **3.	Make scatter plot/joinplot for Rating vs Reviews**

In [None]:
# Does more review mean a better rating always?
sns.jointplot(x = inp0.Reviews, y = inp0.Rating)

##### **4.	Make boxplot for Rating vs Content Rating**

In [None]:
#Is there any difference in the ratings? Are some types liked better?
plt.figure(figsize=[8,6])
sns.boxplot(x = inp0['Content_Rating'], y = inp0.Rating)

##### **5. Make boxplot for Ratings vs. Category**

In [None]:
#Which genre has the best ratings?
plt.figure(figsize=[18,6])
g = sns.boxplot(x = inp0.Category, y = inp0.Rating)
plt.xticks(rotation=90)

#### **Pre-processing the Dataset**

##### **1. Make a copy of the dataset**

In [None]:
#Making a copy
inp1 = inp0.copy()

##### **2. Apply log transformation (np.log1p) to Reviews and Installs**

Reviews and Installs have some values that are still relatively very high.
Before building a linear regression model, you need to reduce the skew.

In [None]:
#Reducing the skew
inp0.Installs.describe()

In [None]:
inp1.Installs = inp1.Installs.apply(np.log1p)

In [None]:
inp1.Reviews = inp1.Reviews.apply(np.log1p)

##### **3. Drop columns App, Last Updated, Current Ver, and Android Ver**

 These variables are not useful for our task.

In [None]:
inp1.dtypes

In [None]:
#Dropping the variables that are not useful for our task
inp1.drop(["App", "Last Updated", "Current Ver", "Android Ver"], axis=1, inplace=True)
inp1.shape

##### **4. Dummy Columns:**


Get dummy columns for Category, Genres, and Content Rating. This needs to be done as the models do not understand categorical data, and all data should be numeric. Dummy encoding is one way to convert character fields to numeric fields. Name of the dataframe should be **inp2**.

In [None]:
inp2 = pd.get_dummies(inp1, drop_first=True)

In [None]:
inp2.columns

In [None]:
inp2.shape