<a href="https://colab.research.google.com/github/AshleyBrooks213/DS-Unit-1-Sprint-1-Data-Wrangling-and-Storytelling/blob/master/Unit2Sprint3MOD1_DS_21_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)

In [1]:
import pandas as pd


df = pd.read_csv('googleplaystore.csv',
                 index_col='Last Updated')



In [2]:
#Getting rid of the weird row with misplaced and confusing values
df.dropna(subset=['Content Rating'], inplace=True)

In [3]:
df['Content Rating'].isnull().sum()

0

In [4]:
df.index=pd.to_datetime(df.index)

In [5]:
print(df.shape)
df.head()

(10840, 12)


Unnamed: 0_level_0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Current Ver,Android Ver
Last Updated,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2018-01-07,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,1.0.0,4.0.3 and up
2018-01-15,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,2.0.0,4.0.3 and up
2018-08-01,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,1.2.4,4.0.3 and up
2018-06-08,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,Varies with device,4.2 and up
2018-06-20,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,1.1,4.4 and up


In [6]:
df.select_dtypes('object').nunique().sort_values()



Type                 2
Content Rating       6
Installs            21
Category            33
Android Ver         33
Price               92
Genres             119
Size               461
Current Ver       2831
Reviews           6001
App               9659
dtype: int64

In [7]:
df.isnull().sum()
df['Type'].unique()
df.dropna(subset=['Type'], inplace=True)


In [8]:
df['Type'].unique()

array(['Free', 'Paid'], dtype=object)

In [9]:
df['Rating'].isnull().sum()

1473

#**Wrangle**

*    Remove rows with no targets in target column
*    Remove High Cardinality Categorical Columns
*    Will need SimleImputer later on for other nan values

In [10]:
def wrangle(X):
  #Make a copy of X
  X = X.copy()

  #Drop Target rows with no targets
  X.dropna(subset=['Rating'], inplace=True)

  # Create `Ratings` column as target (Binary Classification)
  X['Great_Rating'] = (X['Rating'] >= 4).astype(int)

  # Drop `'Rating'` col to avoid leakage
  X.drop(columns='Rating', inplace=True)

  #Remove High Cardinality Categorical Columns
  high_card_cols = [col for col in X.select_dtypes('object').columns
                    if X[col].nunique() > 33]

  X.drop(columns=high_card_cols, inplace=True)

  return X





In [11]:
df = wrangle(df)

In [12]:
df.sort_values(by='Last Updated', ascending=True)

Unnamed: 0_level_0,Category,Installs,Type,Content Rating,Android Ver,Great_Rating
Last Updated,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2010-05-21,FAMILY,"100,000+",Free,Everyone,1.5 and up,1
2011-01-30,GAME,"50,000+",Free,Everyone,1.6 and up,1
2011-03-16,TOOLS,"100,000+",Free,Everyone,1.6 and up,1
2011-04-11,GAME,"5,000,000+",Free,Everyone 10+,2.0 and up,0
2011-04-16,GAME,"50,000+",Free,Everyone,1.6 and up,1
...,...,...,...,...,...,...
2018-08-08,HEALTH_AND_FITNESS,"1,000+",Paid,Everyone,4.2 and up,1
2018-08-08,SHOPPING,"1,000,000+",Free,Everyone,Varies with device,1
2018-08-08,GAME,"50,000,000+",Free,Teen,4.1 and up,1
2018-08-08,FINANCE,"5,000+",Free,Everyone,5.0 and up,0


#**Split Data**

In [13]:
target='Great_Rating'
y = df[target]
X = df.drop(target, axis=1)

In [14]:
training_set = df[df.index.year < 2016]
validation_set = df[(df.index.year < 2018) & (df.index.year > 2016)]
test_set =  df[df.index.year == 2018]

train_mask = X.index.year < 2016
X_train, y_train = X.loc[train_mask], y.loc[train_mask]

val_mask = (X.index.year > 2016) & (X.index.year < 2018)
X_val, y_val = X.loc[val_mask], y.loc[val_mask]

test_mask = X.index.year == 2018
X_test, y_test = X.loc[test_mask], y.loc[test_mask]

#**Establish Baseline**

In [15]:
print('Baseline Accuracy:', y_train.value_counts(normalize=True).max())

Baseline Accuracy: 0.6643159379407616


#**Build Model**

In [16]:
df['Great_Rating'].value_counts()

1    7368
0    1998
Name: Great_Rating, dtype: int64

Complete these tasks for your project, and document your decisions.

 Choose your target. Which column in your tabular dataset will you predict?
 Is your problem regression or classification? 
 **Rating is the target.** 
 **However, I created a new Binary Classifcation target and named that column 'Great_Ratings'**
 How is your target distributed?
Classification: How many classes? Are the classes imbalanced?
**It is now a Binary Classification target and the classes are imbalanced.**
Regression: Is the target right-skewed? If so, you may want to log transform the target.
 Choose your evaluation metric(s).
Classification: Is your majority class frequency >= 50% and < 70% ? 
**The majority class frequency is over 70%. So, accuracy could be misleading.**
If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
**I will also look at Precision, Recall, and F1**
**Then, Classification Report (on Validation)**
**Create one for all f the models you create. So, make one for Random Forest and Logistic Regression Models**
**Then, plot a Confusion Matrix. ROC Curve <-- Explains how your model performs on several different thresholds.**
plot_roc_curve(model_lr, X_val, y_val);
plt.plot([(0,0), (1,1)], color='grey', linestyle=('--)

Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
 Choose which observations you will use to train, validate, and test your model.
Are some observations outliers? Will you exclude them?
Will you do a random split or a time-based split?
**I performed a time based split on the data.**
 Begin to clean and explore your data.
 Begin to choose which features, if any, to exclude. Would some features "leak" future information?
 **I excluded features, created a new target feature, and removed the old target feature to prevent leakage.**