# <b>Project #3 - Classification</b>

## <b>Import used packages</b>

In [1]:
import numpy as np
import pandas as pd
import sklearn
import sklearn.preprocessing, sklearn.cluster, sklearn.metrics
import scipy.spatial
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import AgglomerativeClustering, DBSCAN
import re

## <b>Loading data from - in our case it is a csv file</b>

In [2]:
df_full = pd.read_csv('./googleplaystore.csv', sep=',') 
df_full

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10838,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


## <b>Let's take a look at the basic dataset information</b>

### <b>Dataset columns?</b>
- App: Application Name
- Category: Category the app belongs to
- Rating: Overall user rating of the app
- Reviews: Number of user reviews for the app
- Size: Size of the app
- Installs: Number of user downloads/installs for the app
- Type: Paid or Free
- Price: Price of the app
- Content Rating: Age group the app is targeted at
- Genres: An app can belong to multiple genres
- Last updated: Date when the app was last updated on Play Store
- Current Ver: Current version of the app available on Play Store
- Andoird Ver: Min. required Android version


### <b>Dataset shape?</b>
- 13 cols, 10841 rows


### <b>It would be better to edit and convert columns to correct datatypes</b>
1. Price has dollar sign before value, it would be better to remove dollar sign and convert into numeric value
2. Reviews should be integer
3. Created new col Size Numeric, M and k units removed and k calculated to M 
4. Let's create a new column with the min required android version rounded to an integer
5. Let's create a new column with the current version of app rounded to an integer
6. Let's create Int number from installs 

In [3]:
# there was a one incorrect record, cells was wrongly sorted
df_full = df_full.loc[df_full['App'] !=
                      'Life Made WI-Fi Touchscreen Photo Frame']

# -------- 1. --------
df_full['Price'] = df_full['Price'].str.replace(
    '$', '', regex=True).astype(float)

# -------- 2. --------
df_full['Reviews'] = pd.to_numeric(df_full["Reviews"])

# -------- 3. --------
df_full['Size'] = df_full['Size'].apply(lambda x: round(
    float(x.replace('k', '')) / 1024, 1) if 'k' in x else x.replace('M', ''))
df_full['Size'] = df_full['Size'].loc[df_full['Size']
                                                      != 'Varies with device'].astype(float)

# -------- 4. --------
df_full['Android Ver'] = df_full.loc[df_full['Android Ver']
                                             != 'Varies with device']['Android Ver']
df_full['Android Ver'] = df_full['Android Ver'].apply(lambda x: re.search(
    '[0-9]+', str(x)).group() if re.search('[0-9]+', str(x)) else 0).astype(int)

# -------- 5. --------
df_full['Current Ver'] = df_full.loc[df_full['Current Ver']
                                             != 'Varies with device']['Current Ver']
df_full['Current Ver'] = df_full['Current Ver'].apply(lambda x: re.search(
    '[0-9]+', str(x)).group()[0] if re.search('[0-9]+', str(x)) else 0).astype(int)

# -------- 6. --------
df_full['Installs'] = df_full['Installs'].apply(lambda x: str(x).replace('+', '').replace(',', '')).astype(int)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_full['Price'] = df_full['Price'].str.replace(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_full['Reviews'] = pd.to_numeric(df_full["Reviews"])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_full['Size'] = df_full['Size'].apply(lambda x: round(
A value is trying to be set on a copy of a s

### <b>Are there any missing values?</b>
- Mostly not, a few columns have negligibly few missing values
- Rating has cca 1/11 of records empty
- Size empty values are records with value equals to 'Varies with device'

In [4]:
df_full.isna().sum().sort_values()

App                  0
Category             0
Reviews              0
Installs             0
Price                0
Content Rating       0
Genres               0
Last Updated         0
Current Ver          0
Android Ver          0
Type                 1
Rating            1474
Size              1695
dtype: int64

### <b>Let's fill missing values with mean or remove them</b>

In [5]:
rating_mean_value=df_full['Rating'].mean()
df_full['Rating'].fillna(value=rating_mean_value, inplace=True)

size_mean_value=df_full['Size'].mean()
df_full['Size'].fillna(value=size_mean_value, inplace=True)

df_full=df_full.dropna(subset=['Type'])

df_full.isna().sum().sort_values()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_full['Rating'].fillna(value=rating_mean_value, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_full['Size'].fillna(value=size_mean_value, inplace=True)


App               0
Category          0
Rating            0
Reviews           0
Size              0
Installs          0
Type              0
Price             0
Content Rating    0
Genres            0
Last Updated      0
Current Ver       0
Android Ver       0
dtype: int64

## <b>If we want to use Classification algorithm, we gotta work with numeric values only, let's use get_dummies() function for convert our non numeric values to numeric</b>

In [6]:
df = pd.get_dummies(df_full)