# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [1426]:
#Import your libraries

import numpy as np
import pandas as pd

# Introduction

In this lab, we will use two datasets. Both datasets contain variables that describe apps from the Google Play Store. We will use our knowledge in feature extraction to process these datasets and prepare them for the use of a ML algorithm.

# Challenge 1 - Loading and Extracting Features from the First Dataset

#### In this challenge, our goals are: 

* Exploring the dataset.
* Identify the columns with missing values.
* Either replacing the missing values in each column or drop the columns.
* Conver each column to the appropriate type.

#### The first dataset contains different information describing the apps. 

Load the dataset into the variable `google_play` in the cell below. The dataset is in the file `googleplaystore.csv`

In [1427]:
# Your code here:

google_play = pd.read_csv('/Users/basakbuluttekin/IronHack_New/ironhack_labs/DAFT_0410/module_3/Lab_5_Feature-Extraction/data/googleplaystore.csv')

#### Examine all variables and their types in the following cell

In [1428]:
google_play.dtypes

App                object
Category           object
Rating            float64
Reviews            object
Size               object
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

#### Since this dataset only contains one numeric column, let's skip the `describe()` function and look at the first 5 rows using the `head()` function

In [1429]:
google_play


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10838,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


#### We can see that there are a few columns that could be coerced to numeric.

Start with the reviews column. We can evaluate what value is causing this column to be of object type finding the non-numeric values in this column. To do this, we recall the `to_numeric()` function. With this function, we are able to coerce all non-numeric data to null. We can then use the `isnull()` function to subset our dataframe using the True/False column that this function generates.

In the cell below, transform the Reviews column to numeric and assign this new column to the variable `Reviews_numeric`. Make sure to coerce the errors.

In [1430]:
google_play['Reviews_numeric'] = pd.to_numeric(google_play['Reviews'], errors = 'coerce')

Next, create a column containing True/False values using the `isnull()` function. Assign this column to the `Reviews_isnull` variable.

In [1431]:
google_play['Reviews_isnull'] = google_play['Reviews_numeric'].isnull()


Finally, subset the `google_play` with `Reviews_isnull`. This should give you all the rows that contain non-numeric characters.

Your output should look like:

![Reviews_bool.png](reviews-bool.png)

In [1432]:
google_play[google_play['Reviews_isnull'] == True]


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,Reviews_numeric,Reviews_isnull
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,,,True


#### We see that Google Play is using a shorthand for millions. 

Let's write a function to transform this data.

Steps:

1. Create a function that returns the correct numeric values of *Reviews*.
1. Define a test string with `M` in the last character.
1. Test your function with the test string. Make sure your function works correctly. If not, modify your functions and test again.

In [1433]:
# Your code here

def convert_string_to_numeric(s):
    """
    Convert a string value to numeric. If the last character of the string is `M`, obtain the 
    numeric part of the string, multiply it with 1,000,000, then return the result. Otherwise, 
    convert the string to numeric value and return the result.
    
    Args:
        s: The Reviews score in string format.

    Returns:
        The correct numeric value of the Reviews score.
    """
    
    if s.endswith('M'):
            multiplier = 1000000
            s = s[0:len(s)-1] # strip multiplier character
            return float(float(s) * multiplier)
    else:
        return float(s)
   

test_string = '4.0M'

convert_string_to_numeric(test_string) == 4000000

True

The last step is to apply the function to the `Reviews` column in the following cell:

In [1434]:
google_play['Reviews'] = google_play['Reviews'].apply(convert_string_to_numeric)


Check the non-numeric `Reviews` row again. It should have been fixed now and you should see:

![Reviews_bool_fixed.png](reviews-bool-fixed.png)

In [1435]:
google_play[google_play['Reviews'] == 3000000]


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,Reviews_numeric,Reviews_isnull
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3000000.0,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,,,True


Also check the variable types of `google_play`. The `Reviews` column should be a `float64` type now.

In [1436]:
google_play.dtypes


App                 object
Category            object
Rating             float64
Reviews            float64
Size                object
Installs            object
Type                object
Price               object
Content Rating      object
Genres              object
Last Updated        object
Current Ver         object
Android Ver         object
Reviews_numeric    float64
Reviews_isnull        bool
dtype: object

#### The next column we will look at is `Size`. We start by looking at all unique values in `Size`:

*Hint: use `unique()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html))*.

In [1437]:
google_play['Size'].unique()


array(['19M', '14M', '8.7M', '25M', '2.8M', '5.6M', '29M', '33M', '3.1M',
       '28M', '12M', '20M', '21M', '37M', '2.7M', '5.5M', '17M', '39M',
       '31M', '4.2M', '7.0M', '23M', '6.0M', '6.1M', '4.6M', '9.2M',
       '5.2M', '11M', '24M', 'Varies with device', '9.4M', '15M', '10M',
       '1.2M', '26M', '8.0M', '7.9M', '56M', '57M', '35M', '54M', '201k',
       '3.6M', '5.7M', '8.6M', '2.4M', '27M', '2.5M', '16M', '3.4M',
       '8.9M', '3.9M', '2.9M', '38M', '32M', '5.4M', '18M', '1.1M',
       '2.2M', '4.5M', '9.8M', '52M', '9.0M', '6.7M', '30M', '2.6M',
       '7.1M', '3.7M', '22M', '7.4M', '6.4M', '3.2M', '8.2M', '9.9M',
       '4.9M', '9.5M', '5.0M', '5.9M', '13M', '73M', '6.8M', '3.5M',
       '4.0M', '2.3M', '7.2M', '2.1M', '42M', '7.3M', '9.1M', '55M',
       '23k', '6.5M', '1.5M', '7.5M', '51M', '41M', '48M', '8.5M', '46M',
       '8.3M', '4.3M', '4.7M', '3.3M', '40M', '7.8M', '8.8M', '6.6M',
       '5.1M', '61M', '66M', '79k', '8.4M', '118k', '44M', '695k', '1.6M',
     

You should have seen lots of unique values of the app sizes.

#### While we can convert most of the `Size` values to numeric in the same way we converted the `Reviews` values, there is one value that is impossible to convert.

What is that badass value? Enter it in the next cell and calculate the proportion of its occurence to the total number of records of `google_play`.

In [1438]:
# Your code here

def convert_string_to_numeric2(s):
    """
    Convert a string value to numeric. If the last character of the string is `M`, obtain the 
    numeric part of the string, multiply it with 1,000,000, then return the result. Otherwise, 
    convert the string to numeric value and return the result.
    
    Args:
        s: The Reviews score in string format.

    Returns:
        The correct numeric value of the Reviews score.
    """
    
    if s.endswith('M'):
            multiplier = 1000000
            s = s[0:len(s)-1] # strip multiplier character
            return float(float(s) * multiplier)
    elif s.endswith('k'):
            multiplier = 1000
            s = s[0:len(s)-1] # strip multiplier character
            return float(float(s) * multiplier)
    else:
        return s
   

test_string = '4.0M'

convert_string_to_numeric2(test_string) == 4000000

True

In [1439]:
google_play['Size'] = google_play['Size'].apply(convert_string_to_numeric2)



In [1440]:
#badass_value detection:
google_play['Size'].unique().tolist()

[19000000.0,
 14000000.0,
 8700000.0,
 25000000.0,
 2800000.0,
 5600000.0,
 29000000.0,
 33000000.0,
 3100000.0,
 28000000.0,
 12000000.0,
 20000000.0,
 21000000.0,
 37000000.0,
 2700000.0,
 5500000.0,
 17000000.0,
 39000000.0,
 31000000.0,
 4200000.0,
 7000000.0,
 23000000.0,
 6000000.0,
 6100000.0,
 4600000.0,
 9200000.0,
 5200000.0,
 11000000.0,
 24000000.0,
 'Varies with device',
 9400000.0,
 15000000.0,
 10000000.0,
 1200000.0,
 26000000.0,
 8000000.0,
 7900000.0,
 56000000.0,
 57000000.0,
 35000000.0,
 54000000.0,
 201000.0,
 3600000.0,
 5700000.0,
 8600000.0,
 2400000.0,
 27000000.0,
 2500000.0,
 16000000.0,
 3400000.0,
 8900000.0,
 3900000.0,
 2900000.0,
 38000000.0,
 32000000.0,
 5400000.0,
 18000000.0,
 1100000.0,
 2200000.0,
 4500000.0,
 9800000.0,
 52000000.0,
 9000000.0,
 6700000.0,
 30000000.0,
 2600000.0,
 7100000.0,
 3700000.0,
 22000000.0,
 7400000.0,
 6400000.0,
 3200000.0,
 8199999.999999999,
 9900000.0,
 4900000.0,
 9500000.0,
 5000000.0,
 5900000.0,
 13000000.0,
 7

In [1441]:
google_play[google_play['Size'] == 'Varies with device']

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,Reviews_numeric,Reviews_isnull
37,Floor Plan Creator,ART_AND_DESIGN,4.1,36639.0,Varies with device,"5,000,000+",Free,0,Everyone,Art & Design,"July 14, 2018",Varies with device,2.3.3 and up,36639.0,False
42,Textgram - write on photos,ART_AND_DESIGN,4.4,295221.0,Varies with device,"10,000,000+",Free,0,Everyone,Art & Design,"July 30, 2018",Varies with device,Varies with device,295221.0,False
52,Used Cars and Trucks for Sale,AUTO_AND_VEHICLES,4.6,17057.0,Varies with device,"1,000,000+",Free,0,Everyone,Auto & Vehicles,"July 30, 2018",Varies with device,Varies with device,17057.0,False
67,Ulysse Speedometer,AUTO_AND_VEHICLES,4.3,40211.0,Varies with device,"5,000,000+",Free,0,Everyone,Auto & Vehicles,"July 30, 2018",Varies with device,Varies with device,40211.0,False
68,REPUVE,AUTO_AND_VEHICLES,3.9,356.0,Varies with device,"100,000+",Free,0,Everyone,Auto & Vehicles,"May 25, 2018",Varies with device,Varies with device,356.0,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10713,My Earthquake Alerts - US & Worldwide Earthquakes,WEATHER,4.4,3471.0,Varies with device,"100,000+",Free,0,Everyone,Weather,"July 24, 2018",Varies with device,Varies with device,3471.0,False
10725,Posta App,MAPS_AND_NAVIGATION,3.6,8.0,Varies with device,"1,000+",Free,0,Everyone,Maps & Navigation,"September 27, 2017",Varies with device,4.4 and up,8.0,False
10765,Chat For Strangers - Video Chat,SOCIAL,3.4,622.0,Varies with device,"100,000+",Free,0,Mature 17+,Social,"May 23, 2018",Varies with device,Varies with device,622.0,False
10826,Frim: get new friends on local chat rooms,SOCIAL,4.0,88486.0,Varies with device,"5,000,000+",Free,0,Mature 17+,Social,"March 23, 2018",Varies with device,Varies with device,88486.0,False


In [1442]:
# badass is 'Varies with device' and we see it 1695 times which means 15% of the dataset.
len(google_play[google_play['Size'] == 'Varies with device'])/len(google_play)

0.15635089013928605

#### While this column may be useful for other types of analysis, we opt to drop it from our dataset. 

There are two reasons. First, the majority of the data are ordinal but a sizeable proportion are missing because we cannot convert them to numerical values. Ordinal data are both numerical and categorical, and they usually can be ranked (e.g. 82k is smaller than 91M). In contrast, non-ordinal categorical data such as blood type and eye color cannot be ranked. The second reason is as a categorical column, it has too many unique values to produce meaningful insights. Therefore, in our case the simplest strategy would be to drop the column.

Drop the column in the cell below (use `inplace=True`)

In [1443]:
google_play.drop(columns={'Size', 'Reviews_numeric', 'Reviews_isnull'}, inplace = True)
google_play.head(2)


Unnamed: 0,App,Category,Rating,Reviews,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159.0,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967.0,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up


#### Now let's look at how many missing values are in each column. 

This will give us an idea of whether we should come up with a missing data strategy or give up on the column all together. In the next column, find the number of missing values in each column: 

*Hint: use the `isna()` and `sum()` functions.*

In [1444]:
google_play.isna().sum()


App                  0
Category             0
Rating            1474
Reviews              0
Installs             0
Type                 1
Price                0
Content Rating       1
Genres               0
Last Updated         0
Current Ver          8
Android Ver          3
dtype: int64

You should find the column with the most missing values is now `Rating`.

#### What is the proportion of the missing values in `Rating` to the total number of records?

Enter your answer in the cell below.

In [1445]:
len(google_play[google_play['Rating'].isna() == True])/len(google_play)


0.13596531685268887

A sizeable proportion of the `Rating` column is missing. A few other columns also contain several missing values.

#### We opt to preserve these columns and remove the rows containing missing data.

In particular, we don't want to drop the `Rating` column because:

* It is one of the most important columns in our dataset. 

* Since the dataset is not a time series, the loss of these rows will not have a negative impact on our ability to analyze the data. It will, however, cause us to lose some meaningful observations. But the loss is limited compared to the gain we receive by preserving these columns.

In the cell below, remove all rows containing at least one missing value. Use the `dropna()` function ([documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html)). Assign the new dataframe to the variable `google_missing_removed`.

In [1446]:
google_missing_removed = google_play.dropna(axis=0)


In [1447]:
google_missing_removed.head(2)

Unnamed: 0,App,Category,Rating,Reviews,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159.0,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967.0,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up


From now on, we use the `google_missing_removed` variable instead of `google_play`.

#### Next, we look at the `Last Updated` column.

The `Last Updated` column seems to contain a date, though it is classified as an object type. Let's convert this column using the `pd.to_datetime` function ([documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)).

In [1448]:
pd.to_datetime(google_missing_removed['Last Updated'])


0       2018-01-07
1       2018-01-15
2       2018-08-01
3       2018-06-08
4       2018-06-20
           ...    
10834   2017-06-18
10836   2017-07-25
10837   2018-07-06
10839   2015-01-19
10840   2018-07-25
Name: Last Updated, Length: 9360, dtype: datetime64[ns]

#### The last column we will transform is `Price`. 

We start by looking at the unique values of this column.

In [1449]:
google_missing_removed['Price'].unique()


array(['0', '$4.99', '$3.99', '$6.99', '$7.99', '$5.99', '$2.99', '$3.49',
       '$1.99', '$9.99', '$7.49', '$0.99', '$9.00', '$5.49', '$10.00',
       '$24.99', '$11.99', '$79.99', '$16.99', '$14.99', '$29.99',
       '$12.99', '$2.49', '$10.99', '$1.50', '$19.99', '$15.99', '$33.99',
       '$39.99', '$3.95', '$4.49', '$1.70', '$8.99', '$1.49', '$3.88',
       '$399.99', '$17.99', '$400.00', '$3.02', '$1.76', '$4.84', '$4.77',
       '$1.61', '$2.50', '$1.59', '$6.49', '$1.29', '$299.99', '$379.99',
       '$37.99', '$18.99', '$389.99', '$8.49', '$1.75', '$14.00', '$2.00',
       '$3.08', '$2.59', '$19.40', '$3.90', '$4.59', '$15.46', '$3.04',
       '$13.99', '$4.29', '$3.28', '$4.60', '$1.00', '$2.95', '$2.90',
       '$1.97', '$2.56', '$1.20'], dtype=object)

Since all prices are ordinal data without exceptions, we can tranform this column by removing the dollar sign and converting to numeric. We can create a new column called `Price Numerical` and drop the original column.

We will achieve our goal in three steps. Follow the instructions of each step below.

#### First we remove the dollar sign. Do this in the next cell by applying the `str.replace` function to the column to replace `$` with an empty string (`''`).

In [1450]:
google_missing_removed['Price Numerical'] = google_missing_removed['Price'].str.replace(r'^\$', '', regex=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  google_missing_removed['Price Numerical'] = google_missing_removed['Price'].str.replace(r'^\$', '', regex=True)


#### Second step, coerce the `Price Numerical` column to numeric.

In [1451]:
google_missing_removed['Price Numerical']= pd.to_numeric(google_missing_removed['Price Numerical'], errors='coerce')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  google_missing_removed['Price Numerical']= pd.to_numeric(google_missing_removed['Price Numerical'], errors='coerce')


**Finally, drop the original `Price` column.**

In [1452]:
google_missing_removed.drop(columns='Price', inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  google_missing_removed.drop(columns='Price', inplace=True)


Now check the variable types of `google_missing_removed`. Make sure:

* `Size` and `Price` columns have been removed.
* `Rating`, `Reviews`, and `Price Numerical` have the type of `float64`.
* `Last Updated` has the type of `datetime64`.

In [1453]:
google_missing_removed.dtypes

App                 object
Category            object
Rating             float64
Reviews            float64
Installs            object
Type                object
Content Rating      object
Genres              object
Last Updated        object
Current Ver         object
Android Ver         object
Price Numerical    float64
dtype: object

# Challenge 2 - Loading and Extracting Features from the Second Dataset

Load the second dataset to the variable `google_reviews`. The data is in the file `googleplaystore_user_reviews.csv`.

In [1454]:
# Your code here:

google_review = pd.read_csv('/Users/basakbuluttekin/IronHack_New/ironhack_labs/DAFT_0410/module_3/Lab_5_Feature-Extraction/data/googleplaystore_user_reviews.csv')

#### This dataset contains the top 100 reviews for each app. 

Let's examine this dataset using the `head` function

In [1455]:
google_review.head()


Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
0,10 Best Foods for You,I like eat delicious food. That's I'm cooking ...,Positive,1.0,0.533333
1,10 Best Foods for You,This help eating healthy exercise regular basis,Positive,0.25,0.288462
2,10 Best Foods for You,,,,
3,10 Best Foods for You,Works great especially going grocery store,Positive,0.4,0.875
4,10 Best Foods for You,Best idea us,Positive,1.0,0.3


#### The main piece of information we would like to extract from this dataset is the proportion of positive reviews of each app. 

Columns like `Sentiment_Polarity` and `Sentiment_Subjectivity` are not to our interests because we have no clue how to use them. We do not care about `Translated_Review` because natural language processing is too complex for us at present (in fact the `Sentiment`, `Sentiment_Polarity`, and `Sentiment_Subjectivity` columns are derived from `Translated_Review` the data scientists). 

What we care about in this challenge is `Sentiment`. To be more precise, we care about **what is the proportion of *Positive* sentiment of each app**. This will require us to aggregate the `Sentiment` data by `App` in order to calculate the proportions.

Now that you are clear about what we are trying to achieve, follow the steps below that will walk you through towards our goal.

#### Our first step will be to remove all rows with missing sentiment. 

In the next cell, drop all rows with missing data using the `dropna()` function and assign this new dataframe to `review_missing_removed`.

In [1456]:
review_missing_removed = google_review.dropna(axis=0)


In [1457]:
review_missing_removed

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
0,10 Best Foods for You,I like eat delicious food. That's I'm cooking ...,Positive,1.000000,0.533333
1,10 Best Foods for You,This help eating healthy exercise regular basis,Positive,0.250000,0.288462
3,10 Best Foods for You,Works great especially going grocery store,Positive,0.400000,0.875000
4,10 Best Foods for You,Best idea us,Positive,1.000000,0.300000
5,10 Best Foods for You,Best way,Positive,1.000000,0.300000
...,...,...,...,...,...
64222,Housing-Real Estate & Property,Most ads older many agents ..not much owner po...,Positive,0.173333,0.486667
64223,Housing-Real Estate & Property,"If photos posted portal load, fit purpose. I'm...",Positive,0.225000,0.447222
64226,Housing-Real Estate & Property,"Dumb app, I wanted post property rent give opt...",Negative,-0.287500,0.250000
64227,Housing-Real Estate & Property,I property business got link SMS happy perform...,Positive,0.800000,1.000000


#### Now, use the `value_counts()` function ([documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html)) to get a sense on how many apps are in this dataset and their review counts.

In [1458]:
review_missing_removed['App'].value_counts()
#review_missing_removed.groupby('App')['Translated_Review']


App
Bowmasters                        312
Helix Jump                        273
Angry Birds Classic               273
Calorie Counter - MyFitnessPal    254
Duolingo: Learn Languages Free    240
                                 ... 
Draw a Stickman: EPIC 2             1
HD Camera                           1
Draw In                             1
Draw A Stickman                     1
Best Fiends - Free Puzzle Game      1
Name: count, Length: 865, dtype: int64

#### Now the tough part comes. Let's plan how we will achieve our goal:

1. We will count the number of reviews that contain *Positive* in the `Sentiment` column.

1. We will create a new dataframe to contain the `App` name, the number of positive reviews, and the total number of reviews of each app.

1. We will then loop the new dataframe to calculate the postivie review portion of each app.

#### Step 1: Count the number of positive reviews.

In the following cell, write a function that takes a column and returns the number of times *Positive* appears in the column. 

*Hint: One option is to use the `np.where()` function ([documentation](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.where.html)).*

In [1459]:
# Your code below

def positive_function(x):
    """
    Count how many times the string `Positive` appears in a column (exact string match).
    
    Args:
        x: data column
    
    Returns:
        The number of occurrences of `Positive` in the column data.
    """
    return np.where(x == "Positive")[0].size

#### Step 2: Create a new dataframe to contain the `App` name, the number of positive reviews, and the total number of reviews of each app

We will group `review_missing_removed` by the `App` column, then aggregate the grouped dataframe on the number of positive reviews and the total review counts of each app. The result will be assigned to a new variable `google_agg`. Here is the ([documentation on how to achieve it](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.agg.html)). Take a moment or two to read the documentation and google examples because it is pretty complex.

When you obtain `google_agg`, check its values to make sure it has an `App` column as its index as well as a `Positive` column and a `Total` column. Your output should look like:

![Positive Reviews Agg](positive-review-agg.png)

*Hint: Use `positive_function` you created earlier as part of the param passed to the `agg()` function in order to aggregate the number of positive reviews.*

#### Bonus:

As of Pandas v0.23.4, you may opt to supply an array or an object to `agg()`. If you use the array param, you'll need to rename the columns so that their names are `Positive` and `Total`. Using the object param will allow you to create the aggregated columns with the desirable names without renaming them. However, you will probably encounter a warning indicating supplying an object to `agg()` will become outdated. It's up to you which way you will use. Try both ways out. Any way is fine as long as it works.

In [1460]:
df1 = review_missing_removed.groupby('App')['Sentiment'].agg(lambda x: (positive_function(x), len(x)))
google_agg = pd.DataFrame(df1.values.tolist(), columns=['Positive', 'Total'], index = df1.index)
google_agg

Unnamed: 0_level_0,Positive,Total
App,Unnamed: 1_level_1,Unnamed: 2_level_1
10 Best Foods for You,162,194
104 找工作 - 找工作 找打工 找兼職 履歷健檢 履歷診療室,31,40
11st,23,39
1800 Contacts - Lens Store,64,80
1LINE – One Line with One Touch,27,38
...,...,...
Hotels.com: Book Hotel Rooms & Find Vacation Deals,39,68
Hotspot Shield Free VPN Proxy & Wi-Fi Security,17,34
Hotstar,14,32
Hotwire Hotel & Car Rental App,16,33


Print the first 5 rows of `google_agg` to check it.

In [1461]:
google_agg.head(5)


Unnamed: 0_level_0,Positive,Total
App,Unnamed: 1_level_1,Unnamed: 2_level_1
10 Best Foods for You,162,194
104 找工作 - 找工作 找打工 找兼職 履歷健檢 履歷診療室,31,40
11st,23,39
1800 Contacts - Lens Store,64,80
1LINE – One Line with One Touch,27,38


#### Add a derived column to `google_agg` that is the ratio of the `Positive` and the `Total` columns. Call this column `Positive Ratio`. 

Make sure to account for the case where the denominator is zero using the `np.where()` function.

In [1462]:
np.where(google_agg['Total'] == 0)

google_agg['Positive Ratio'] = google_agg['Positive']/google_agg['Total']

#### Now drop the `Positive` and `Total` columns. Do this with `inplace=True`.

In [1463]:
google_agg.drop(columns={'Positive', 'Total'}, inplace=True)

Print the first 5 rows of `google_agg`. Your output should look like:

![Positive Reviews Agg](positive-review-ratio.png)

In [1464]:
google_agg.head(5)


Unnamed: 0_level_0,Positive Ratio
App,Unnamed: 1_level_1
10 Best Foods for You,0.835052
104 找工作 - 找工作 找打工 找兼職 履歷健檢 履歷診療室,0.775
11st,0.589744
1800 Contacts - Lens Store,0.8
1LINE – One Line with One Touch,0.710526


# Challenge 3 - Join the Dataframes

In this part of the lab, we will join the two dataframes and obtain a dataframe that contains features we can use in our ML algorithm.

In the next cell, join the `google_missing_removed` dataframe with the `google_agg` dataframe on the `App` column. Assign this dataframe to the variable `google`.

In [1465]:
google = google_missing_removed.join(google_agg, on='App')


#### Let's look at the final result using the `head()` function. Your final product should look like:

![Final Product](google-final-head.png)

In [1466]:
google.head(5)


Unnamed: 0,App,Category,Rating,Reviews,Installs,Type,Content Rating,Genres,Last Updated,Current Ver,Android Ver,Price Numerical,Positive Ratio
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159.0,"10,000+",Free,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up,0.0,
1,Coloring book moana,ART_AND_DESIGN,3.9,967.0,"500,000+",Free,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up,0.0,0.590909
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510.0,"5,000,000+",Free,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up,0.0,
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644.0,"50,000,000+",Free,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up,0.0,
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967.0,"100,000+",Free,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up,0.0,


 # Challenge 4- Feature selection and modelling

Let's work with data about bank marketing.  You can find the dataset and description in the data folder
Please use RFE, RFECV and SelectFomModel  for selecting the features from your dataset.

Step 1. Check your data. Clean and encode them if necessary

In [1467]:
bank_data = pd.read_excel('/Users/basakbuluttekin/IronHack_New/ironhack_labs/DAFT_0410/module_3/Lab_5_Feature-Extraction/data/bank_marketing.xlsx')

In [1468]:
bank_data

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,2022-01-01,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,2022-01-01,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,2022-01-01,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,2022-01-01,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,2022-01-01,93.994,-36.4,4.857,5191.0,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,no,yes,no,cellular,nov,fri,...,1,999,0,nonexistent,NaT,94.767,-50.8,1.028,,yes
41184,46,blue-collar,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,NaT,94.767,-50.8,1.028,,no
41185,56,retired,married,university.degree,no,yes,no,cellular,nov,fri,...,2,999,0,nonexistent,NaT,94.767,-50.8,1.028,,no
41186,44,technician,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,NaT,94.767,-50.8,1.028,,yes


In [1469]:
bank_data.dtypes

age                        int64
job                       object
marital                   object
education                 object
default                   object
housing                   object
loan                      object
contact                   object
month                     object
day_of_week               object
duration                   int64
campaign                   int64
pdays                      int64
previous                   int64
poutcome                  object
emp.var.rate      datetime64[ns]
cons.price.idx           float64
cons.conf.idx            float64
euribor3m                float64
nr.employed              float64
y                         object
dtype: object

In [1470]:
bank_data.isna().sum()

age                   0
job                   0
marital               0
education             0
default               0
housing               0
loan                  0
contact               0
month                 0
day_of_week           0
duration              0
campaign              0
pdays                 0
previous              0
poutcome              0
emp.var.rate      17191
cons.price.idx        0
cons.conf.idx         0
euribor3m             0
nr.employed       33425
y                     0
dtype: int64

In [1471]:
bank_data.head(3)

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,2022-01-01,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,2022-01-01,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,2022-01-01,93.994,-36.4,4.857,5191.0,no


In [1472]:
bank_data.drop(columns = {'nr.employed','emp.var.rate'}, inplace =True)

In [1473]:
bank_data['y'].value_counts()

y
no     36548
yes     4640
Name: count, dtype: int64

In [1474]:
bank_data['housing'].value_counts()

housing
yes        21576
no         18622
unknown      990
Name: count, dtype: int64

In [1475]:
bank_data['contact'].value_counts()

contact
cellular     26144
telephone    15044
Name: count, dtype: int64

In [1476]:
bank_data['loan'].value_counts()

loan
no         33950
yes         6248
unknown      990
Name: count, dtype: int64

In [1477]:
bank_data['job'].value_counts()

job
admin.           10422
blue-collar       9254
technician        6743
services          3969
management        2924
retired           1720
entrepreneur      1456
self-employed     1421
housemaid         1060
unemployed        1014
student            875
unknown            330
Name: count, dtype: int64

In [1478]:
columns_to_encode = ['job', 'loan', 'contact', 'housing', 'marital', 'education', 'default', 'poutcome', 'month', 'day_of_week' ]
bank_data_new = pd.get_dummies(bank_data, columns = columns_to_encode)

In [1479]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
bank_data_new['y'] = label_encoder.fit_transform(bank_data_new['y'])

In [1480]:
bank_data_new['y'].value_counts()

y
0    36548
1     4640
Name: count, dtype: int64

Step 2. List your features

In [1481]:
print(bank_data_new.columns)

Index(['age', 'duration', 'campaign', 'pdays', 'previous', 'cons.price.idx',
       'cons.conf.idx', 'euribor3m', 'y', 'job_admin.', 'job_blue-collar',
       'job_entrepreneur', 'job_housemaid', 'job_management', 'job_retired',
       'job_self-employed', 'job_services', 'job_student', 'job_technician',
       'job_unemployed', 'job_unknown', 'loan_no', 'loan_unknown', 'loan_yes',
       'contact_cellular', 'contact_telephone', 'housing_no',
       'housing_unknown', 'housing_yes', 'marital_divorced', 'marital_married',
       'marital_single', 'marital_unknown', 'education_basic.4y',
       'education_basic.6y', 'education_basic.9y', 'education_high.school',
       'education_illiterate', 'education_professional.course',
       'education_university.degree', 'education_unknown', 'default_no',
       'default_unknown', 'default_yes', 'poutcome_failure',
       'poutcome_nonexistent', 'poutcome_success', 'month_apr', 'month_aug',
       'month_dec', 'month_jul', 'month_jun', 'month_mar

In [1482]:
bank_data_new[['y']]

Unnamed: 0,y
0,0
1,0
2,0
3,0
4,0
...,...
41183,1
41184,0
41185,0
41186,1


In [1483]:
import sklearn
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    precision_recall_curve, roc_curve, roc_auc_score, f1_score, confusion_matrix
    
)
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from numpy import array

In [1484]:
def evaluate_model(note,Mparam, model, X_test, y_test, results):
    pred = model.predict(X_test)
    score = model.score(X_test, y_test)
    precision = precision_score(y_test,pred)
    recall = recall_score(y_test, pred)
    f1 = f1_score(y_test, pred)
    conf_matrix = confusion_matrix(y_test, pred)
    true_positives = conf_matrix[0][0]
    false_positives = conf_matrix[0][1]
    false_negatives = conf_matrix[1][0]
    true_negatives = conf_matrix[1][1]
    new_result = pd.DataFrame({'note':note,'param': Mparam,'accuracy':score,'precision':precision,'recall':recall,'f1_score':f1,'true_positives':true_positives,'false_positives':false_positives,'false_negatives':false_negatives,'true_negatives':true_negatives},index=[0])
    print(confusion_matrix(y_test, pred))
    return pd.concat([results,new_result],axis=0)
results = pd.DataFrame(columns=['note','param','accuracy','precision','recall','f1_score', 'true_positives','false_positives','false_negatives','true_negatives'])

Step 3. Apply the RandomForestClassifier and LogisticRegression model with default parameters to your data 
What is the accuracy for your models?

In [1485]:
Y = bank_data_new['y'].copy()
X = bank_data_new.drop(columns= 'y').copy()

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=0)
model = LogisticRegression()
model.fit(X_train, Y_train)
score = model.score(X_test, Y_test)
print("model_accuracy_score: ",score)
pred = model.predict(X_test)

results = evaluate_model('Before Feature Selection_Logistic','Default', model, X_test, Y_test, results)



model_accuracy_score:  0.9111391667475964
[[8906  233]
 [ 682  476]]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [1486]:
results

Unnamed: 0,note,param,accuracy,precision,recall,f1_score,true_positives,false_positives,false_negatives,true_negatives
0,Before Feature Selection_Logistic,Default,0.911139,0.671368,0.411054,0.509909,8906,233,682,476


In [1487]:
Y = bank_data_new['y'].copy()
X = bank_data_new.drop(columns= 'y').copy()

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=0)
model1 = RandomForestClassifier()
model1.fit(X_train, Y_train)
score1 = model1.score(X_test, Y_test)
print("model1_accuracy_score: ",score1)

pred = model1.predict(X_test)

results = evaluate_model('Before Feature Selection_RandomForest','Default', model1, X_test, Y_test, results)

model1_accuracy_score:  0.9138584053607847
[[8890  249]
 [ 638  520]]


In [1488]:
results

Unnamed: 0,note,param,accuracy,precision,recall,f1_score,true_positives,false_positives,false_negatives,true_negatives
0,Before Feature Selection_Logistic,Default,0.911139,0.671368,0.411054,0.509909,8906,233,682,476
0,Before Feature Selection_RandomForest,Default,0.913858,0.676203,0.44905,0.539699,8890,249,638,520



Step 4. Select features using SelectFromModel method. Explain, how you define the optimal number of features


In [1489]:
feature_selection = SelectFromModel(model, threshold=None)
feature_selection.fit(X_train, Y_train)
hey = pd.DataFrame(feature_selection.estimator_.coef_).T

status = feature_selection.get_support()
print("Selection status: ", status) 
 
features = array(X.columns)
print("All features:")
print(features) 

Selection status:  [False False False False False False  True  True False  True False False
 False  True False False False False False False False False False  True
  True False False False False  True  True False False False  True False
 False False  True False  True  True False False False False False False
 False False  True  True  True False False False False False False False
 False]
All features:
['age' 'duration' 'campaign' 'pdays' 'previous' 'cons.price.idx'
 'cons.conf.idx' 'euribor3m' 'job_admin.' 'job_blue-collar'
 'job_entrepreneur' 'job_housemaid' 'job_management' 'job_retired'
 'job_self-employed' 'job_services' 'job_student' 'job_technician'
 'job_unemployed' 'job_unknown' 'loan_no' 'loan_unknown' 'loan_yes'
 'contact_cellular' 'contact_telephone' 'housing_no' 'housing_unknown'
 'housing_yes' 'marital_divorced' 'marital_married' 'marital_single'
 'marital_unknown' 'education_basic.4y' 'education_basic.6y'
 'education_basic.9y' 'education_high.school' 'education_illiterat

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [1490]:
feature_selection.threshold_

0.03953717188043171

In [1491]:
feature_selection.get_feature_names_out()

array(['cons.conf.idx', 'euribor3m', 'job_blue-collar', 'job_retired',
       'contact_cellular', 'contact_telephone', 'marital_married',
       'marital_single', 'education_basic.9y',
       'education_university.degree', 'default_no', 'default_unknown',
       'month_jun', 'month_mar', 'month_may'], dtype=object)


Step 5. Apply the RandomForestClassifier and LogisticRegression model with default parameters to your data (only selected features). What is the accuracy for your models?

In [1492]:
data_final = bank_data_new[['cons.conf.idx', 'euribor3m', 'job_blue-collar', 'job_retired',
       'contact_cellular', 'contact_telephone', 'marital_married',
       'marital_single', 'education_basic.9y',
       'education_university.degree', 'default_no', 'default_unknown',
       'month_jun', 'month_mar', 'month_may', 'y']]

In [1493]:
data_final

Unnamed: 0,cons.conf.idx,euribor3m,job_blue-collar,job_retired,contact_cellular,contact_telephone,marital_married,marital_single,education_basic.9y,education_university.degree,default_no,default_unknown,month_jun,month_mar,month_may,y
0,-36.4,4.857,False,False,False,True,True,False,False,False,True,False,False,False,True,0
1,-36.4,4.857,False,False,False,True,True,False,False,False,False,True,False,False,True,0
2,-36.4,4.857,False,False,False,True,True,False,False,False,True,False,False,False,True,0
3,-36.4,4.857,False,False,False,True,True,False,False,False,True,False,False,False,True,0
4,-36.4,4.857,False,False,False,True,True,False,False,False,True,False,False,False,True,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,-50.8,1.028,False,True,True,False,True,False,False,False,True,False,False,False,False,1
41184,-50.8,1.028,True,False,True,False,True,False,False,False,True,False,False,False,False,0
41185,-50.8,1.028,False,True,True,False,True,False,False,True,True,False,False,False,False,0
41186,-50.8,1.028,False,False,True,False,True,False,False,False,True,False,False,False,False,1


In [1503]:
Y = data_final['y'].copy()
X = data_final.drop(columns= 'y').copy()

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=0)
model2 = LogisticRegression()
model2.fit(X_train, Y_train)
score = model2.score(X_test, Y_test)
print("model1_accuracy_score: ",score)

pred = model2.predict(X_test)

results = evaluate_model('After Feature Selection_Logistic','Default', model2, X_test, Y_test, results)

model1_accuracy_score:  0.8884141011945227
[[9084   55]
 [1094   64]]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [1504]:
Y = data_final['y'].copy()
X = data_final.drop(columns= 'y').copy()

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=0)
model3 = RandomForestClassifier()
model3.fit(X_train, Y_train)
score1 = model3.score(X_test, Y_test)
print("model1_accuracy_score: ",score1)

pred = model3.predict(X_test)

results = evaluate_model('After Feature Selection_RansomForest','Default', model3, X_test, Y_test, results)

model1_accuracy_score:  0.8897737205011168
[[8864  275]
 [ 860  298]]


In [1505]:
results

Unnamed: 0,note,param,accuracy,precision,recall,f1_score,true_positives,false_positives,false_negatives,true_negatives
0,Before Feature Selection_Logistic,Default,0.911139,0.671368,0.411054,0.509909,8906,233,682,476
0,Before Feature Selection_RandomForest,Default,0.913858,0.676203,0.44905,0.539699,8890,249,638,520
0,After Feature Selection_Logistic,Default,0.888414,0.537815,0.055268,0.100235,9084,55,1094,64
0,After Feature Selection_RansomForest,Default,0.889774,0.52007,0.25734,0.34431,8864,275,860,298



Step 6. Select features using RFE and RFECV methods. Explain, how you define the optimal number of features for each of the cases.

In [1496]:
# RFE feature selection:
Y = bank_data_new['y'].copy()
X = bank_data_new.drop(columns = 'y').copy()

In [1497]:
from sklearn.feature_selection import RFE
model =LinearRegression()
rfe = RFE(model, n_features_to_select=20, verbose=False)
rfe.fit(X, Y)


In [1498]:
status = rfe.get_support()
print("Selection status: ", status) 
rfe.get_feature_names_out()

Selection status:  [False False False False False False False False False False False False
 False False False False False False False False  True False  True  True
  True False  True False False False False False False False False False
 False False False False False  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True False False False False
 False]


array(['loan_no', 'loan_yes', 'contact_cellular', 'contact_telephone',
       'housing_unknown', 'default_unknown', 'default_yes',
       'poutcome_failure', 'poutcome_nonexistent', 'poutcome_success',
       'month_apr', 'month_aug', 'month_dec', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep'],
      dtype=object)

In [1499]:
data_final2 = bank_data_new[['loan_no', 'loan_yes', 'contact_cellular', 'contact_telephone',
       'housing_unknown', 'default_unknown', 'default_yes',
       'poutcome_failure', 'poutcome_nonexistent', 'poutcome_success',
       'month_apr', 'month_aug', 'month_dec', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep','y']]

In [1500]:
# RFE feature selection:
Y = bank_data_new['y'].copy()
X = bank_data_new.drop(columns = 'y').copy()

In [1501]:
from sklearn.feature_selection import RFECV
model =LinearRegression()
rfecv = RFECV(model, step=1, cv=5)
rfecv = rfecv.fit(X, Y)

In [1502]:
status = rfecv.get_support()
print("Selection status: ", status) 
rfecv.get_feature_names_out()

Selection status:  [False  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True]


array(['duration', 'campaign', 'pdays', 'previous', 'cons.price.idx',
       'cons.conf.idx', 'euribor3m', 'job_admin.', 'job_blue-collar',
       'job_entrepreneur', 'job_housemaid', 'job_management',
       'job_retired', 'job_self-employed', 'job_services', 'job_student',
       'job_technician', 'job_unemployed', 'job_unknown', 'loan_no',
       'loan_unknown', 'loan_yes', 'contact_cellular',
       'contact_telephone', 'housing_no', 'housing_unknown',
       'housing_yes', 'marital_divorced', 'marital_married',
       'marital_single', 'marital_unknown', 'education_basic.4y',
       'education_basic.6y', 'education_basic.9y',
       'education_high.school', 'education_illiterate',
       'education_professional.course', 'education_university.degree',
       'education_unknown', 'default_no', 'default_unknown',
       'default_yes', 'poutcome_failure', 'poutcome_nonexistent',
       'poutcome_success', 'month_apr', 'month_aug', 'month_dec',
       'month_jul', 'month_jun', 'month_m

In [1506]:
data_final3 = bank_data_new[['duration', 'campaign', 'pdays', 'previous', 'cons.price.idx',
       'cons.conf.idx', 'euribor3m', 'job_admin.', 'job_blue-collar',
       'job_entrepreneur', 'job_housemaid', 'job_management',
       'job_retired', 'job_self-employed', 'job_services', 'job_student',
       'job_technician', 'job_unemployed', 'job_unknown', 'loan_no',
       'loan_unknown', 'loan_yes', 'contact_cellular',
       'contact_telephone', 'housing_no', 'housing_unknown',
       'housing_yes', 'marital_divorced', 'marital_married',
       'marital_single', 'marital_unknown', 'education_basic.4y',
       'education_basic.6y', 'education_basic.9y',
       'education_high.school', 'education_illiterate',
       'education_professional.course', 'education_university.degree',
       'education_unknown', 'default_no', 'default_unknown',
       'default_yes', 'poutcome_failure', 'poutcome_nonexistent',
       'poutcome_success', 'month_apr', 'month_aug', 'month_dec',
       'month_jul', 'month_jun', 'month_mar', 'month_may', 'month_nov',
       'month_oct', 'month_sep', 'day_of_week_fri', 'day_of_week_mon',
       'day_of_week_thu', 'day_of_week_tue', 'day_of_week_wed','y']]

Step 7. Apply the RandomForestClassifier and LogisticRegression model with default parameters to your data (you will have 4 models, taking into account two sets of features that you got). What is the accuracy for your models?


In [1507]:
Y = data_final2['y'].copy()
X = data_final2.drop(columns= 'y').copy()

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=0)
model4 = LogisticRegression()
model4.fit(X_train, Y_train)
score = model4.score(X_test, Y_test)
print("model1_accuracy_score: ",score)

pred = model4.predict(X_test)

results = evaluate_model('After RFE_Logistic','Default', model4, X_test, Y_test, results)

model1_accuracy_score:  0.8955035447217636
[[9023  116]
 [ 960  198]]


In [1508]:
Y = data_final2['y'].copy()
X = data_final2.drop(columns= 'y').copy()

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=0)
model5 = RandomForestClassifier()
model5.fit(X_train, Y_train)
score = model5.score(X_test, Y_test)
print("model1_accuracy_score: ",score)

pred = model5.predict(X_test)

results = evaluate_model('After RFE_RandomForest','Default', model5, X_test, Y_test, results)

model1_accuracy_score:  0.8969602796931145
[[9004  135]
 [ 926  232]]


In [1509]:
Y = data_final3['y'].copy()
X = data_final3.drop(columns= 'y').copy()

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=0)
model6 = LogisticRegression()
model6.fit(X_train, Y_train)
score = model6.score(X_test, Y_test)
print("model1_accuracy_score: ",score)

pred = model6.predict(X_test)

results = evaluate_model('After RFECV_Logistic','Default', model6, X_test, Y_test, results)

model1_accuracy_score:  0.9110420510828396
[[8902  237]
 [ 679  479]]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [1510]:
Y = data_final3['y'].copy()
X = data_final3.drop(columns= 'y').copy()

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=0)
model7 = RandomForestClassifier()
model7.fit(X_train, Y_train)
score = model7.score(X_test, Y_test)
print("model1_accuracy_score: ",score)

pred = model7.predict(X_test)

results = evaluate_model('After RFE_RandomForest','Default', model7, X_test, Y_test, results)

model1_accuracy_score:  0.9142468680198116
[[8879  260]
 [ 623  535]]


In [1511]:
results

Unnamed: 0,note,param,accuracy,precision,recall,f1_score,true_positives,false_positives,false_negatives,true_negatives
0,Before Feature Selection_Logistic,Default,0.911139,0.671368,0.411054,0.509909,8906,233,682,476
0,Before Feature Selection_RandomForest,Default,0.913858,0.676203,0.44905,0.539699,8890,249,638,520
0,After Feature Selection_Logistic,Default,0.888414,0.537815,0.055268,0.100235,9084,55,1094,64
0,After Feature Selection_RansomForest,Default,0.889774,0.52007,0.25734,0.34431,8864,275,860,298
0,After RFE_Logistic,Default,0.895504,0.630573,0.170984,0.269022,9023,116,960,198
0,After RFE_RandomForest,Default,0.89696,0.632153,0.200345,0.304262,9004,135,926,232
0,After RFECV_Logistic,Default,0.911042,0.668994,0.413644,0.511206,8902,237,679,479
0,After RFE_RandomForest,Default,0.914247,0.672956,0.462003,0.547875,8879,260,623,535











Step 8. Please make the conclusion about the utility of the feature selection

In [None]:
# Feature selection has high importance to get the highest accuracy score for the training models.
# RFECV selection with RandomForestClassifier gave the highest accuracy score but at the same time it took
# the longest time to run so when we have big datasets, it will be computationaly expensive but 
# results will be better. 