# Python Frameworks for Machine Learning

In this tutorial, you will learn how to operate with two of the most fundamental frameworks when it comes to Machine Learning: Pandas and NumPy

## NumPy (https://github.com/numpy/numpy)

NumPy is a Python library aimed at array operations. This library is well-suited for Machine Learning tasks, where speed and resources are very important (processing using NumPy arrays is up to 50x faster than doing so using Python lists).

Let's begin by installing and importing the library. (You just have to run the next cells)

In [None]:
pip install numpy

Note: Usually, you need to create a Python environment and install all the required packages there. However, since this is a jupyter notebook tutorial, we can do it like this.

In [32]:
import numpy as np

Now, let's do some exercises leveraging the potentialities of NumPy

#### Exercise 1 - Create a numpy array

It is quite simple to create a numpy array. You'll see an example and reproduce the method, using a different sequence of numbers.

In [None]:
example_array = np.array([1, 2, 3, 4, 5])
print("Here's an example: " + str(example_array) + "\n")
print("Now it's your turn.\n")

your_array = #insert code here

print(your_array)

As you can see, we used a list to create a NumPy array, but we can also use tuples.

In [None]:
array_from_tuple = np.array((1, 2, 3, 4, 5))

print(array_from_tuple)

#### Exercise 2 - Multi-dimensional arrays

So far, we've introduced one-dimensional arrays (which are basically matrices), but NumPy offers the possibility to create arrays with n dimensions. See an example:

In [None]:
array_3d = np.array([[2, 3, 4], [5, 6, 7], [2, 3, 7]])

print(array_3d)

Now, create an array with 7 dimensions.

In [None]:
array_7d = #insert code here

print(array_7d)

You can check the shape of your array using the following method:

In [None]:
print(array_7d.shape())

#### Exercise 4 - Indexation

In NumPy, you access an element within an array using straight brackets - [] - same as you do with Python lists. See the following example.

In [None]:
array = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])

print(array[0])
print(array[1])
print(array[7])


Insert the code necessary to access and print the 9th element of the previous array.

In [None]:
ninth = #insert code here

print(ninth)

What is the length of the array? Inser the code needed to answer this question.

In [None]:
arr_len = #insert code here

print()

Since the length of the array is 9, the 9th element of the array is the last one. You can access the last element of an array in a different manner:

In [None]:
last_element = array[-1]

print(last_element)

For multi-dimensional arrays, you need to use multiple indexes. Consider the following bi-dimensional array.

In [None]:
bidim_array = np.array([[1, 2, 3], [4, 5, 6]])

print(bidim_array)

You want to access the element with value "4" in this array, which is situate in the second row, first column. How can you do that?

In [None]:
required_element = #insert code here

print(required_element)

#### Exercise 5 - Array Slicing

Now that you know all about indexing, let's talk slicing. You can slice sub-parts of arrays using the following notation:

In [None]:
array = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])

sliced_array = array[1:6]

print(sliced_array)

Your turn. Access and print the 3 middle elements of the array ("4", "5", and "6") using a slicing technique.

In [None]:
slice = #insert code here

print(slice)

You can also use negative indexes. See the example and try it yourself.

In [None]:
print(array[-6:-2])


In [None]:

your_slice = #insert code here

print(yout_slice)

Now let's get a sequence of elements through steps.

In [None]:
step_slice = array[1:9:2]

print(step_slice)

Your move. Try different step and limit combinations (including negatives).

In [None]:
my_step_slice = #insert code here

print(my_step_slice)

#### Exercise 6 - Data Types

We've only worked with integers so far, but ndarrays are compatible with other types of data. Try to create an array out of a list of strings.

In [None]:
str_array = #insert code here

print(str_array)

Did it work? Now try a list of floats.

In [None]:
flt_array = #insert code here

print(flt_array)

You can also convert arrays to a different data type (as long as the conversion is possible). Check the following conversion of an integer array to a boolean.

In [None]:
int_array = np.array([0, 1, 0, 0])

bool_array = int_array.astype('bool')

print(bool_array)

Now convert the float array you created into an integer array

In [None]:
my_int_array = #insert code here

print(my_int_array)

#### Exercise 7 - Join and Split Arrays

If we wish to join the contents of two or more arrays in a single array, we can use the concatenate method.

In [None]:
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

arr = np.concatenate((arr1, arr2))
print(arr)

Join the contents of the following 4 arrays.

In [None]:
arr1 = np.array([5, 23, 4, 99])
arr2 = np.array([98, 3, 15])
arr3 = np.array([70, 20, 12])
arr4 = np.array([1])

final_arr = # insert code here
print(final_arr)

Now let's split the array you created.

In [None]:
split_results = np.array_split(final_arr, 2)

print(split_results)

Your turn. Split final_arr into 5 arrays.

In [None]:
splitted_arr = # insert code here

print(splitted_results)

#### Exercise 8 - Search Arrays

In data analysis tasks, it may be useful to search for the instances that verify a certain condition within an array. For this purpose, NumPy offers the method where, which does exacly that. See the next example.

In [None]:
arr = np.array([4, 93, 83, 94, 12])

print(np.where(arr==12))

As you can see, the algorithm correctly situated the value "12" in the fourth position of the array.
You can use any type of condition for your search. For certain tasks, it might be useful to situate all instances verifying a certain condition. For instance, considering an array containing the ages of all the patients within a clinical facility, we might want to identify the adults exclusively (+ 18 yo). How would you do that?

In [None]:
ages = np.array([8, 19, 18, 29, 88, 82, 3, 45, 51, 54, 74, 54, 66, 23, 3, 7, 92, 65, 64])

adults_idx = #insert code here

print(adults_idx)

Now keep in mind the previous example. What if you wanted to sort the patients according to their age? NumPy also has a method for that: np.sort. How do you think you can implement it? Complete the following cell adequately to obtain an ordered ages array.

In [None]:
ordered_ages = #insert code here

print(ordered_ages)

Sometimes, we do not want an ordered representation, but the ordered indexes instead, so that we can access them in the original array. For that, we use np.argsort. Give it a try!

In [None]:
sorted_indexes = np.argsort(ages)

print(sorted_indexes)

What do you think happens when you apply np.sort to a string array? Check your guess in the next cell.

In [None]:
string_arr = ["what", "is", "going", "on", "here"]

sorted_str_arr = #insert code here

print(sorted_str_arr)

#### Exercise 9 - Filtering

NumPy is also a powerful tool to filter information. Check the next example.

In [None]:
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# We want only the odd numbers in the array
odd_numbers = []
for number in arr:
    if number % 2 != 0:
        odd_numbers.append(True)
    else:
        odd_numbers.append(False)

filtered_array = arr[odd_numbers]

print(filtered_array)

You can also use the indexes of the elements to filter arrays. Consider Exercise 8, where you had to search for the adult patients within an array of patients' ages. How can you obtain an array composed exclusively by ages of adult patients using filtering techniques?

In [None]:
adults_idxs = #insert code here

filtered_ages = #insert code here

print(filtered_ages)

You concluded your introductory class on NumPy! Congratulations!

# Pandas (https://pandas.pydata.org/)

Pandas is the most famous Python library for data analysis. In this tutorial, you'll learn all the skills needed to operate with pandas in the context of Machine Learning.

Start by installing and importing the library

In [None]:
pip install pandas

In [2]:
import pandas as pd

Let's start by understanding the two major data representations in pandas: DataFrames and Series.

### Series

A pandas series is nothing more than a sequence of values, much like a Python list or a NumPy array. Check next cell:

In [39]:
sample_series = pd.Series([1, 2, 3, 4, 5])

print(sample_series)

0    1
1    2
2    3
3    4
4    5
dtype: int64


A Series is basically a DataFrame column. But let's talk about those by the way!

### DataFrame

A DataFrame is a table containing an array (matrix) of individual entries. Each entry has a row and a column associated. Let's see an example with made-up clinical data.

In [38]:
clincial_df = pd.DataFrame({'Age': np.array([18, 61, 34]), 'Diseases': ['None', 'Diabetes', 'Hypertension'], 'Smoking History': ['Non-smoker', 'Past smoker', 'Current smoker']})

print(clinical_df)

   Age Smoking History Disease History
0   18      Non-smoker            None
1   62               2        Diabetes
2   34      Non-smoker    Hypertension


As you can see, we can represent and organize different types of data with DataFrames. Create a random DataFrame with movie-related data (be creative!)

In [None]:
movie_df = #insert your code here

print(movie_df)

#### Exercise 1 - Import Data

Although I'm sure you'll agree it's pretty easy to create simple DataFrames manually, data scientists do not usually work with small data quatities such as the one you just created. Instead, they need to import data from big and heavy data files. Pandas is compatible with the most popular data formats such as ".csv", ".txt", ".json", ".sql", among others.

Let's begin by importing tabular data from a ".csv" file. The data refers to the top songs listened in Spotify from 2009 to 2019 (https://www.kaggle.com/datasets/paradisejoy/top-hits-spotify-from-20002019?select=songs_normalize.csv). To load it, run the following cell.

In [3]:
songs_data = pd.read_csv("songs_normalize.csv")

You successfully loaded the dataset from a ".csv" file. Check it's size with the attribute "shape".

In [None]:
print("DataFrame shape:")
print(songs_data.shape)

print("This DataFrame has " + str(songs_data.shape[0]) + " rows and " + str(songs_data.shape[1]) + " columns.")

You can check the contents of the DataFrame by analyzing its first few rows. You do that using the atribute "head".

In [54]:
print(songs_data.head())

           artist                    song  duration_ms  explicit  year  \
0  Britney Spears  Oops!...I Did It Again       211160     False  2000   
1       blink-182    All The Small Things       167066     False  1999   
2      Faith Hill                 Breathe       250546     False  1999   
3        Bon Jovi            It's My Life       224493     False  2000   
4          *NSYNC             Bye Bye Bye       200560     False  2000   

   popularity  danceability  energy  key  loudness  mode  speechiness  \
0          77         0.751   0.834    1    -5.444     0       0.0437   
1          79         0.434   0.897    0    -4.918     1       0.0488   
2          66         0.529   0.496    7    -9.007     1       0.0290   
3          78         0.551   0.913    0    -4.063     0       0.0466   
4          65         0.614   0.928    8    -4.806     0       0.0516   

   acousticness  instrumentalness  liveness  valence    tempo         genre  
0        0.3000          0.000018    0

##### Exercise 2 - Indexing

In order to work with data, you need to know how to access data. Pandas offers you a fluid way to do that through DataFrames.

You can access a column in many ways:

In [9]:
print("Method 1:")
songs_data.song


Method 1:


0                       Oops!...I Did It Again
1                         All The Small Things
2                                      Breathe
3                                 It's My Life
4                                  Bye Bye Bye
                         ...                  
1995                                    Sucker
1996                              Cruel Summer
1997                                The Git Up
1998    Dancing With A Stranger (with Normani)
1999                                   Circles
Name: song, Length: 2000, dtype: object

In [8]:
print("Method2")

songs_data['song']

Method2


0                       Oops!...I Did It Again
1                         All The Small Things
2                                      Breathe
3                                 It's My Life
4                                  Bye Bye Bye
                         ...                  
1995                                    Sucker
1996                              Cruel Summer
1997                                The Git Up
1998    Dancing With A Stranger (with Normani)
1999                                   Circles
Name: song, Length: 2000, dtype: object

In [7]:
print("Method3")

songs_data.loc[:,'song']

Method3


0                       Oops!...I Did It Again
1                         All The Small Things
2                                      Breathe
3                                 It's My Life
4                                  Bye Bye Bye
                         ...                  
1995                                    Sucker
1996                              Cruel Summer
1997                                The Git Up
1998    Dancing With A Stranger (with Normani)
1999                                   Circles
Name: song, Length: 2000, dtype: object

Although they all work, you should take special interest in the third method, as it is more flexible than the other ones in terms of data accessibility.

Access the "popularity" column using the third method.

In [None]:
popularity_col = #insert code here

print(popularity_col)

.loc is a label-based selection method. It has a similar, correspondent index-based selection method: .iloc. Try it! Select and print the second column (songs columns) of the songs DataFrame using .iloc.

In [None]:
second_col = #insert code here

print(second_col)

This methods allow you to select both rows and columns at the same time, and you can even select groups of both entities (similar to what you did when slicing NumPy arrays). Check this example:

In [73]:
example = songs_data.iloc[1:3, 0:4]

print(example)

       artist                  song  duration_ms  explicit
1   blink-182  All The Small Things       167066     False
2  Faith Hill               Breathe       250546     False


The first argument refers to the rows, while the second indicates the columns. Now it's your turn, select the contents of the last three columns, for the patients from 6 to 10.

In [None]:
selection = #insert your code here
print(selection)

The .loc method also allows you to perform conditional selection.

In [78]:
songs_data.loc[songs_data.artist == 'Britney Spears']

Unnamed: 0,artist,song,duration_ms,explicit,year,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,genre
0,Britney Spears,Oops!...I Did It Again,211160,False,2000,77,0.751,0.834,1,-5.444,0,0.0437,0.3,1.8e-05,0.355,0.894,95.053,pop
34,Britney Spears,Born to Make You Happy,243533,False,1999,58,0.633,0.922,11,-4.842,0,0.0454,0.116,0.000465,0.071,0.686,84.11,pop
98,Britney Spears,Lucky,206226,False,2000,65,0.765,0.791,8,-5.707,1,0.0317,0.262,0.000154,0.0669,0.966,95.026,pop
111,Britney Spears,I'm a Slave 4 U,203600,False,2001,69,0.847,0.843,5,-3.579,0,0.106,0.415,0.000134,0.107,0.963,110.027,pop
223,Britney Spears,Overprotected - Radio Edit,198600,False,2001,61,0.682,0.894,0,-1.73,0,0.0727,0.0381,0.0,0.416,0.845,95.992,pop
278,Britney Spears,"I'm Not a Girl, Not Yet a Woman",231066,False,2001,58,0.534,0.543,3,-6.857,1,0.0245,0.579,0.0,0.112,0.418,78.996,pop
326,Britney Spears,Me Against the Music (feat. Madonna) - LP Vers...,223773,False,2003,59,0.804,0.836,6,-6.635,0,0.089,0.32,0.0,0.213,0.85,120.046,pop
402,Britney Spears,Toxic,198800,False,2003,81,0.774,0.838,5,-3.914,0,0.114,0.0249,0.025,0.242,0.924,143.04,pop
429,Britney Spears,My Prerogative,213893,False,2004,53,0.749,0.938,10,-4.423,0,0.118,0.0127,2e-06,0.103,0.619,111.014,pop
459,Britney Spears,Everytime,230306,False,2003,63,0.398,0.284,3,-12.852,1,0.0337,0.966,8.6e-05,0.116,0.114,109.599,pop


Try it yourself. Select all data correspondent to songs with energy above 0.98.

In [10]:
selected_data = #insert code here

selected_data

Unnamed: 0,artist,song,duration_ms,explicit,year,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,genre
74,Madison Avenue,Don't Call Me Baby,228140,False,1999,56,0.808,0.982,3,-6.588,0,0.0311,0.0585,0.00689,0.35,0.961,124.999,Dance/Electronic
117,Lasgo,Something,220973,False,2001,65,0.643,0.981,7,-6.644,0,0.0439,0.0271,8.9e-05,0.11,0.38,140.01,pop
472,Green Day,American Idiot,176346,True,2004,77,0.38,0.988,1,-2.042,1,0.0639,2.6e-05,7.9e-05,0.368,0.769,186.113,rock
477,Special D.,Come With Me - Radio Edit,185133,False,2004,61,0.739,0.999,7,-5.077,1,0.0803,0.13,0.00224,0.28,0.501,139.982,pop
836,Basshunter,All I Ever Wanted - Radio Edit,176453,False,2008,65,0.645,0.984,4,-7.051,1,0.0508,0.164,0.00701,0.164,0.553,144.954,pop
1309,Bingo Players,Get Up (Rattle) - Vocal Edit,166933,False,2013,1,0.801,0.985,7,-2.69,1,0.0645,0.0205,7e-06,0.296,0.722,127.99,"pop, Dance/Electronic"


#### Exercise 3 - Data Analysis and Summarization with Pandas

Now the fun really begins! As you may know, a fundamental part of the machine learning workflow is the analysis of the data you'll be working with. You need to know your data, in order to understand how you can make the most out of it.

Pandas describe method provides an overview on the fundamental statistical values describing the distribution of each variable (column) within the dataset.

In [12]:
songs_data.describe()

Unnamed: 0,duration_ms,year,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,228748.1245,2009.494,59.8725,0.667438,0.720366,5.378,-5.512434,0.5535,0.103568,0.128955,0.015226,0.181216,0.55169,120.122558
std,39136.569008,5.85996,21.335577,0.140416,0.152745,3.615059,1.933482,0.497254,0.096159,0.173346,0.087771,0.140669,0.220864,26.967112
min,113000.0,1998.0,0.0,0.129,0.0549,0.0,-20.514,0.0,0.0232,1.9e-05,0.0,0.0215,0.0381,60.019
25%,203580.0,2004.0,56.0,0.581,0.622,2.0,-6.49025,0.0,0.0396,0.014,0.0,0.0881,0.38675,98.98575
50%,223279.5,2010.0,65.5,0.676,0.736,6.0,-5.285,1.0,0.05985,0.0557,0.0,0.124,0.5575,120.0215
75%,248133.0,2015.0,73.0,0.764,0.839,8.0,-4.16775,1.0,0.129,0.17625,6.8e-05,0.241,0.73,134.2655
max,484146.0,2020.0,89.0,0.975,0.999,11.0,-0.276,1.0,0.576,0.976,0.985,0.853,0.973,210.851


Note: As you can see, not all variables are represented in this description, such as the "artist" column. This happens because those variables are cathegorical, and are therefore uncompatible with statistical distributions. You'll learn how to deal with those later.

You can also study a single variable individually.

In [None]:
songs_data.year.describe()

Now, we challenge you to try the following methods, which are appliable in a similar way to the previous example.

Exercise 3.1. To see a list of unique values, we can use the unique() method. Check all the different artists represented in the dataset, applying this method.

In [None]:
unique_artists = #insert code here

unique_artists

Exercise 3.2. Sometimes, it might be useful to check how many times each value appears in a column. Check how many times is each artis represented. (To do this, you'll need the value_counts() method)

In [None]:
artists_representivity = #insert code here

artists_representivity

Note: This method is especially useful to search for data imbalance, which happens when the distribution of a certain variable is biased towards a certain value. In machine learning tasks, it is fundamental to test this trait in the target variable.

#### Exercise 4 - Indexing and Sorting

Indexation is also a relevant topic. songs_data is ordered, but that might not always be the case. To guarantee your dataset is ordered, you can use the reset_index() method.

In [None]:
songs_data.reset_index()

As you can see, the method preserves the initial indexes as a new column, to get rid of this useless column, you should set "drop" as True.

In [22]:
songs_data.reset_index(drop=True)

In some cases, it might be useful to order the data according to a certain variable. Check the next example, where we ordered songs according to their associated energy.

In [24]:
songs_data.sort_values(by='energy', ascending=True)

Unnamed: 0,artist,song,duration_ms,explicit,year,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,genre
1492,Ed Sheeran,"I See Fire - From ""The Hobbit - The Desolation...",300840,False,2013,71,0.581,0.0549,10,-20.514,0,0.0397,0.559000,0.000000,0.0718,0.234,152.037,pop
496,Gary Jules,Mad World (Feat. Michael Andrews),189506,False,2001,65,0.345,0.0581,3,-17.217,1,0.0374,0.976000,0.000366,0.1030,0.304,174.117,pop
1198,Charlene Soraia,Wherever You Will Go,197577,False,2011,60,0.597,0.1150,9,-9.217,1,0.0334,0.820000,0.000215,0.1110,0.128,111.202,pop
682,Westlife,The Rose,219106,False,2006,0,0.272,0.2030,9,-9.706,1,0.0294,0.784000,0.000000,0.0805,0.172,109.581,pop
486,Katie Melua,The Closest Thing to Crazy,252466,False,2003,55,0.562,0.2190,4,-13.200,1,0.0312,0.856000,0.000296,0.0979,0.106,127.831,"pop, easy listening, jazz"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
74,Madison Avenue,Don't Call Me Baby,228140,False,1999,56,0.808,0.9820,3,-6.588,0,0.0311,0.058500,0.006890,0.3500,0.961,124.999,Dance/Electronic
836,Basshunter,All I Ever Wanted - Radio Edit,176453,False,2008,65,0.645,0.9840,4,-7.051,1,0.0508,0.164000,0.007010,0.1640,0.553,144.954,pop
1309,Bingo Players,Get Up (Rattle) - Vocal Edit,166933,False,2013,1,0.801,0.9850,7,-2.690,1,0.0645,0.020500,0.000007,0.2960,0.722,127.990,"pop, Dance/Electronic"
472,Green Day,American Idiot,176346,True,2004,77,0.380,0.9880,1,-2.042,1,0.0639,0.000026,0.000079,0.3680,0.769,186.113,rock


Try to order the entries in songs_data descending according to the "popularity" values.

In [26]:
popularity_ordered_data = #insert code here

popularity_ordered_data

You can also sort by index, using the following method.

In [27]:
songs_data.sort_index()

Unnamed: 0,artist,song,duration_ms,explicit,year,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,genre
0,Britney Spears,Oops!...I Did It Again,211160,False,2000,77,0.751,0.834,1,-5.444,0,0.0437,0.3000,0.000018,0.3550,0.894,95.053,pop
1,blink-182,All The Small Things,167066,False,1999,79,0.434,0.897,0,-4.918,1,0.0488,0.0103,0.000000,0.6120,0.684,148.726,"rock, pop"
2,Faith Hill,Breathe,250546,False,1999,66,0.529,0.496,7,-9.007,1,0.0290,0.1730,0.000000,0.2510,0.278,136.859,"pop, country"
3,Bon Jovi,It's My Life,224493,False,2000,78,0.551,0.913,0,-4.063,0,0.0466,0.0263,0.000013,0.3470,0.544,119.992,"rock, metal"
4,*NSYNC,Bye Bye Bye,200560,False,2000,65,0.614,0.928,8,-4.806,0,0.0516,0.0408,0.001040,0.0845,0.879,172.656,pop
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,Jonas Brothers,Sucker,181026,False,2019,79,0.842,0.734,1,-5.065,0,0.0588,0.0427,0.000000,0.1060,0.952,137.958,pop
1996,Taylor Swift,Cruel Summer,178426,False,2019,78,0.552,0.702,9,-5.707,1,0.1570,0.1170,0.000021,0.1050,0.564,169.994,pop
1997,Blanco Brown,The Git Up,200593,False,2019,69,0.847,0.678,9,-8.635,1,0.1090,0.0669,0.000000,0.2740,0.811,97.984,"hip hop, country"
1998,Sam Smith,Dancing With A Stranger (with Normani),171029,False,2019,75,0.741,0.520,8,-7.513,1,0.0656,0.4500,0.000002,0.2220,0.347,102.998,pop


Also, you can sort more than one column at a time.

In [28]:
songs_data.sort_values(by=['popularity', 'energy'], ascending=False)

Unnamed: 0,artist,song,duration_ms,explicit,year,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,genre
1322,The Neighbourhood,Sweater Weather,240400,False,2013,89,0.612,0.807,10,-2.810,1,0.0336,0.04950,0.017700,0.1010,0.398,124.053,"rock, pop"
1311,Tom Odell,Another Love,244360,True,2013,88,0.445,0.537,4,-8.532,0,0.0400,0.69500,0.000017,0.0944,0.131,122.769,pop
201,Eminem,Without Me,290320,True,2002,87,0.908,0.669,7,-2.827,1,0.0738,0.00286,0.000000,0.2370,0.662,112.238,hip hop
1613,WILLOW,Wait a Minute!,196520,False,2015,86,0.764,0.705,3,-5.279,0,0.0278,0.03710,0.000019,0.0943,0.672,101.003,"pop, R&B, Dance/Electronic"
6,Eminem,The Real Slim Shady,284200,True,2000,86,0.949,0.661,5,-4.244,0,0.0572,0.03020,0.000000,0.0454,0.760,104.504,hip hop
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
182,Musiq Soulchild,Love,304666,False,2000,0,0.569,0.385,1,-9.919,0,0.0499,0.34200,0.000000,0.0876,0.339,99.738,"pop, R&B"
1602,Justin Bieber,Love Yourself,233720,False,2015,0,0.609,0.378,4,-9.828,1,0.4380,0.83500,0.000000,0.2800,0.515,100.418,pop
685,T-Pain,I'm N Luv (Wit a Stripper) (feat. Mike Jones),265333,True,2005,0,0.731,0.368,8,-10.380,1,0.0688,0.00544,0.000000,0.1930,0.512,145.171,"hip hop, pop, R&B"
584,T-Pain,I'm Sprung,231040,False,2005,0,0.722,0.329,0,-11.617,0,0.1080,0.08800,0.000000,0.0810,0.166,99.991,"hip hop, pop, R&B"


This method uses the second criterion when the first one is not enough to distinguish entries.

#### Exercise 5 - Missing Values

Entries missing values are given the value NaN, short for "Not a Number". For technical reasons these NaN values are always of the float64 dtype. Knowing how to deal with missing values is an essential skill for any data scientist. Fortunately, pandas offers some methods to help us do that.

The data we've been working upon has no missing input occurrences. However, in real data, that happens only in very rare occasions. Run the following cell to obtain a more realistic dataset, by fabricating missing inputs. (This is not something you'll do often, in fact, if you have complete datasets, that's all the better for you! So don't waste your time here, go ahead!)

In [52]:
songs_missing_data = songs_data.copy()

songs_missing_data['popularity'][songs_missing_data.loc[:, 'popularity'] > 70] = np.nan

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  songs_missing_data['popularity'][songs_missing_data.loc[:, 'popularity'] > 70] = np.nan


We'll start by selecting NaN entries in the DataFrame. The method pd.isnull() returns a boolean data representation refflecting wether or not each element corresponds to a missing input.

In [None]:
pd.isnull(songs_missing_data)

Applying this to a single column, we can then use this result to select exclusively the entries of the DataFrame with attributed values regarding the variable chosen. Perform this selection process according to the "popularity" variable.

In [None]:
select_info = #insert code here

selected_songs = songs_data[select_info]

selected_songs

As you can see, you lost a lot of entries. You need to get rid of all missing values in order to train machine learning models. For this reason, this exclusion process is not very viable, considering you'll lose a lot of information in the process. Instead, data scientists usually apply missing imputation methods. A simple imputation method offered by pandas relies on the substitution of all the missing instances for a specific value.

In [55]:
songs_missing_data.fillna("Unknown")

Unnamed: 0,artist,song,duration_ms,explicit,year,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,genre
0,Britney Spears,Oops!...I Did It Again,211160,False,2000,Unknown,0.751,0.834,1,-5.444,0,0.0437,0.3000,0.000018,0.3550,0.894,95.053,pop
1,blink-182,All The Small Things,167066,False,1999,Unknown,0.434,0.897,0,-4.918,1,0.0488,0.0103,0.000000,0.6120,0.684,148.726,"rock, pop"
2,Faith Hill,Breathe,250546,False,1999,66.0,0.529,0.496,7,-9.007,1,0.0290,0.1730,0.000000,0.2510,0.278,136.859,"pop, country"
3,Bon Jovi,It's My Life,224493,False,2000,Unknown,0.551,0.913,0,-4.063,0,0.0466,0.0263,0.000013,0.3470,0.544,119.992,"rock, metal"
4,*NSYNC,Bye Bye Bye,200560,False,2000,65.0,0.614,0.928,8,-4.806,0,0.0516,0.0408,0.001040,0.0845,0.879,172.656,pop
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,Jonas Brothers,Sucker,181026,False,2019,Unknown,0.842,0.734,1,-5.065,0,0.0588,0.0427,0.000000,0.1060,0.952,137.958,pop
1996,Taylor Swift,Cruel Summer,178426,False,2019,Unknown,0.552,0.702,9,-5.707,1,0.1570,0.1170,0.000021,0.1050,0.564,169.994,pop
1997,Blanco Brown,The Git Up,200593,False,2019,69.0,0.847,0.678,9,-8.635,1,0.1090,0.0669,0.000000,0.2740,0.811,97.984,"hip hop, country"
1998,Sam Smith,Dancing With A Stranger (with Normani),171029,False,2019,Unknown,0.741,0.520,8,-7.513,1,0.0656,0.4500,0.000002,0.2220,0.347,102.998,pop


As you can see, all values were replaced by the element "Unknown". However, this is not very helpful for that variable specifically... There are other, better options of values to perform this substitution, such as statistical values related to that variable's distribution within the dataset. Apply the method presented, replacing the missing inputs with the mean value of the known entries, for the popularity variable.

In [None]:
mean_value = songs_missing_data.popularity.mean()

imputed_data = #insert code here

imputed_data

Now do the same, but using the median instead. (Tip there's a method to calculate median similar to the one used above to calculate the mean)

In [61]:
median_value = #insert code here

imputed_data = #insert code here

imputed_data


CAREFUL! In this case, we were able to apply the "fillna" method because only one variable had missing inputs. However, it is common for that to happen in more than one column. In that situtation, we could not apply this method using a single value for all variables, since they should present very distinct distributions.

Of course, this is still not ideal in some cases. In this dataset, we forged the missing values, and they were all correspondent to values outside the known data distribution, which is now limited to the value of 70. For this reason, these estimation methods are far from the true values that are missing. There are other, more robust ways to deal with missing inputs, however, we will not cover them in this tutorial.

#### Exercise 5 - Combining DataFrames