# Python Frameworks for Machine Learning

In this tutorial, you will learn how to operate with two of the most fundamental frameworks when it comes to Machine Learning: Pandas and NumPy

## NumPy (https://github.com/numpy/numpy)

NumPy is a Python library aimed at array operations. This library is well-suited for Machine Learning tasks, where speed and resources are very important (processing using NumPy arrays is up to 50x faster than doing so using Python lists).

Let's begin by installing and importing the library. (You just have to run the next cells)

In [None]:
pip install numpy

Note: Usually, you need to create a Python environment and install all the required packages there. However, since this is a jupyter notebook tutorial, we can do it like this.

In [32]:
import numpy as np

Now, let's do some exercises leveraging the potentialities of NumPy

#### Exercise 1 - Create a numpy array

It is quite simple to create a numpy array. You'll see an example and reproduce the method, using a different sequence of numbers.

In [None]:
example_array = np.array([1, 2, 3, 4, 5])
print("Here's an example: " + str(example_array) + "\n")
print("Now it's your turn.\n")

your_array = #insert code here

print(your_array)

As you can see, we used a list to create a NumPy array, but we can also use tuples.

In [None]:
array_from_tuple = np.array((1, 2, 3, 4, 5))

print(array_from_tuple)

#### Exercise 2 - Multi-dimensional arrays

So far, we've introduced one-dimensional arrays (which are basically matrices), but NumPy offers the possibility to create arrays with n dimensions. See an example:

In [None]:
array_3d = np.array([[2, 3, 4], [5, 6, 7], [2, 3, 7]])

print(array_3d)

Now, create an array with 7 dimensions.

In [None]:
array_7d = #insert code here

print(array_7d)

You can check the shape of your array using the following method:

In [None]:
print(array_7d.shape())

#### Exercise 4 - Indexation

In NumPy, you access an element within an array using straight brackets - [] - same as you do with Python lists. See the following example.

In [None]:
array = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])

print(array[0])
print(array[1])
print(array[7])


Insert the code necessary to access and print the 9th element of the previous array.

In [None]:
ninth = #insert code here

print(ninth)

What is the length of the array? Inser the code needed to answer this question.

In [None]:
arr_len = #insert code here

print()

Since the length of the array is 9, the 9th element of the array is the last one. You can access the last element of an array in a different manner:

In [None]:
last_element = array[-1]

print(last_element)

For multi-dimensional arrays, you need to use multiple indexes. Consider the following bi-dimensional array.

In [None]:
bidim_array = np.array([[1, 2, 3], [4, 5, 6]])

print(bidim_array)

You want to access the element with value "4" in this array, which is situate in the second row, first column. How can you do that?

In [None]:
required_element = #insert code here

print(required_element)

#### Exercise 5 - Array Slicing

Now that you know all about indexing, let's talk slicing. You can slice sub-parts of arrays using the following notation:

In [None]:
array = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])

sliced_array = array[1:6]

print(sliced_array)

Your turn. Access and print the 3 middle elements of the array ("4", "5", and "6") using a slicing technique.

In [None]:
slice = #insert code here

print(slice)

You can also use negative indexes. See the example and try it yourself.

In [None]:
print(array[-6:-2])


In [None]:

your_slice = #insert code here

print(yout_slice)

Now let's get a sequence of elements through steps.

In [None]:
step_slice = array[1:9:2]

print(step_slice)

Your move. Try different step and limit combinations (including negatives).

In [None]:
my_step_slice = #insert code here

print(my_step_slice)

#### Exercise 6 - Data Types

We've only worked with integers so far, but ndarrays are compatible with other types of data. Try to create an array out of a list of strings.

In [None]:
str_array = #insert code here

print(str_array)

Did it work? Now try a list of floats.

In [None]:
flt_array = #insert code here

print(flt_array)

You can also convert arrays to a different data type (as long as the conversion is possible). Check the following conversion of an integer array to a boolean.

In [None]:
int_array = np.array([0, 1, 0, 0])

bool_array = int_array.astype('bool')

print(bool_array)

Now convert the float array you created into an integer array

In [None]:
my_int_array = #insert code here

print(my_int_array)

#### Exercise 7 - Join and Split Arrays

If we wish to join the contents of two or more arrays in a single array, we can use the concatenate method.

In [None]:
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

arr = np.concatenate((arr1, arr2))
print(arr)

Join the contents of the following 4 arrays.

In [None]:
arr1 = np.array([5, 23, 4, 99])
arr2 = np.array([98, 3, 15])
arr3 = np.array([70, 20, 12])
arr4 = np.array([1])

final_arr = # insert code here
print(final_arr)

Now let's split the array you created.

In [None]:
split_results = np.array_split(final_arr, 2)

print(split_results)

Your turn. Split final_arr into 5 arrays.

In [None]:
splitted_arr = # insert code here

print(splitted_results)

#### Exercise 8 - Search Arrays

In data analysis tasks, it may be useful to search for the instances that verify a certain condition within an array. For this purpose, NumPy offers the method where, which does exacly that. See the next example.

In [None]:
arr = np.array([4, 93, 83, 94, 12])

print(np.where(arr==12))

As you can see, the algorithm correctly situated the value "12" in the fourth position of the array.
You can use any type of condition for your search. For certain tasks, it might be useful to situate all instances verifying a certain condition. For instance, considering an array containing the ages of all the patients within a clinical facility, we might want to identify the adults exclusively (+ 18 yo). How would you do that?

In [None]:
ages = np.array([8, 19, 18, 29, 88, 82, 3, 45, 51, 54, 74, 54, 66, 23, 3, 7, 92, 65, 64])

adults_idx = #insert code here

print(adults_idx)

Now keep in mind the previous example. What if you wanted to sort the patients according to their age? NumPy also has a method for that: np.sort. How do you think you can implement it? Complete the following cell adequately to obtain an ordered ages array.

In [None]:
ordered_ages = #insert code here

print(ordered_ages)

Sometimes, we do not want an ordered representation, but the ordered indexes instead, so that we can access them in the original array. For that, we use np.argsort. Give it a try!

In [None]:
sorted_indexes = np.argsort(ages)

print(sorted_indexes)

What do you think happens when you apply np.sort to a string array? Check your guess in the next cell.

In [None]:
string_arr = ["what", "is", "going", "on", "here"]

sorted_str_arr = #insert code here

print(sorted_str_arr)

#### Exercise 9 - Filtering

NumPy is also a powerful tool to filter information. Check the next example.

In [None]:
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# We want only the odd numbers in the array
odd_numbers = []
for number in arr:
    if number % 2 != 0:
        odd_numbers.append(True)
    else:
        odd_numbers.append(False)

filtered_array = arr[odd_numbers]

print(filtered_array)

You can also use the indexes of the elements to filter arrays. Consider Exercise 8, where you had to search for the adult patients within an array of patients' ages. How can you obtain an array composed exclusively by ages of adult patients using filtering techniques?

In [None]:
adults_idxs = #insert code here

filtered_ages = #insert code here

print(filtered_ages)

You concluded your introductory class on NumPy! Congratulations!

# Pandas (https://pandas.pydata.org/)

Pandas is the most famous Python library for data analysis. In this tutorial, you'll learn all the skills needed to operate with pandas in the context of Machine Learning.

Start by installing and importing the library

In [None]:
pip install pandas

In [2]:
import pandas as pd

Let's start by understanding the two major data representations in pandas: DataFrames and Series.

### Series

A pandas series is nothing more than a sequence of values, much like a Python list or a NumPy array. Check next cell:

In [None]:
sample_series = pd.Series([1, 2, 3, 4, 5])

print(sample_series)

A Series is basically a DataFrame column. But let's talk about those by the way!

### DataFrame

A DataFrame is a table containing an array (matrix) of individual entries. Each entry has a row and a column associated. Let's see an example with made-up clinical data.

In [None]:
clincial_df = pd.DataFrame({'Age': np.array([18, 61, 34]), 'Diseases': ['None', 'Diabetes', 'Hypertension'], 'Smoking History': ['Non-smoker', 'Past smoker', 'Current smoker']})

print(clinical_df)

As you can see, we can represent and organize different types of data with DataFrames. Create a random DataFrame with movie-related data (be creative!)

In [None]:
movie_df = #insert your code here

print(movie_df)

#### Exercise 1 - Import Data

Although I'm sure you'll agree it's pretty easy to create simple DataFrames manually, data scientists do not usually work with small data quatities such as the one you just created. Instead, they need to import data from big and heavy data files. Pandas is compatible with the most popular data formats such as ".csv", ".txt", ".json", ".sql", among others.

Let's begin by importing tabular data from a ".csv" file. The data refers to the top songs listened in Spotify from 2009 to 2019 (https://www.kaggle.com/datasets/paradisejoy/top-hits-spotify-from-20002019?select=songs_normalize.csv). To load it, run the following cell.

In [3]:
songs_data = pd.read_csv("songs_normalize.csv")

You successfully loaded the dataset from a ".csv" file. Check it's size with the attribute "shape".

In [None]:
print("DataFrame shape:")
print(songs_data.shape)

print("This DataFrame has " + str(songs_data.shape[0]) + " rows and " + str(songs_data.shape[1]) + " columns.")

You can check the contents of the DataFrame by analyzing its first few rows. You do that using the atribute "head".

In [None]:
print(songs_data.head())

##### Exercise 2 - Indexing

In order to work with data, you need to know how to access data. Pandas offers you a fluid way to do that through DataFrames.

You can access a column in many ways:

In [None]:
print("Method 1:")
songs_data.song


In [None]:
print("Method2")

songs_data['song']

In [None]:
print("Method3")

songs_data.loc[:,'song']

Although they all work, you should take special interest in the third method, as it is more flexible than the other ones in terms of data accessibility.

Access the "popularity" column using the third method.

In [None]:
popularity_col = #insert code here

print(popularity_col)

.loc is a label-based selection method. It has a similar, correspondent index-based selection method: .iloc. Try it! Select and print the second column (songs columns) of the songs DataFrame using .iloc.

In [None]:
second_col = #insert code here

print(second_col)

This methods allow you to select both rows and columns at the same time, and you can even select groups of both entities (similar to what you did when slicing NumPy arrays). Check this example:

In [None]:
example = songs_data.iloc[1:3, 0:4]

print(example)

The first argument refers to the rows, while the second indicates the columns. Now it's your turn, select the contents of the last three columns, for the patients from 6 to 10.

In [None]:
selection = #insert your code here
print(selection)

The .loc method also allows you to perform conditional selection.

In [None]:
songs_data.loc[songs_data.artist == 'Britney Spears']

Try it yourself. Select all data correspondent to songs with energy above 0.98.

In [None]:
selected_data = #insert code here

selected_data

#### Exercise 3 - Data Analysis and Summarization with Pandas

Now the fun really begins! As you may know, a fundamental part of the machine learning workflow is the analysis of the data you'll be working with. You need to know your data, in order to understand how you can make the most out of it.

Pandas describe method provides an overview on the fundamental statistical values describing the distribution of each variable (column) within the dataset.

In [None]:
songs_data.describe()

Note: As you can see, not all variables are represented in this description, such as the "artist" column. This happens because those variables are cathegorical, and are therefore uncompatible with statistical distributions. You'll learn how to deal with those later.

You can also study a single variable individually.

In [None]:
songs_data.year.describe()

Now, we challenge you to try the following methods, which are appliable in a similar way to the previous example.

Exercise 3.1. To see a list of unique values, we can use the unique() method. Check all the different artists represented in the dataset, applying this method.

In [None]:
unique_artists = #insert code here

unique_artists

Exercise 3.2. Sometimes, it might be useful to check how many times each value appears in a column. Check how many times is each artis represented. (To do this, you'll need the value_counts() method)

In [None]:
artists_representivity = #insert code here

artists_representivity

Note: This method is especially useful to search for data imbalance, which happens when the distribution of a certain variable is biased towards a certain value. In machine learning tasks, it is fundamental to test this trait in the target variable.

#### Exercise 4 - Indexing and Sorting

Indexation is also a relevant topic. songs_data is ordered, but that might not always be the case. To guarantee your dataset is ordered, you can use the reset_index() method.

In [None]:
songs_data.reset_index()

As you can see, the method preserves the initial indexes as a new column, to get rid of this useless column, you should set "drop" as True.

In [22]:
songs_data.reset_index(drop=True)

In some cases, it might be useful to order the data according to a certain variable. Check the next example, where we ordered songs according to their associated energy.

In [None]:
songs_data.sort_values(by='energy', ascending=True)

Try to order the entries in songs_data descending according to the "popularity" values.

In [26]:
popularity_ordered_data = #insert code here

popularity_ordered_data

You can also sort by index, using the following method.

In [None]:
songs_data.sort_index()

Also, you can sort more than one column at a time.

In [None]:
songs_data.sort_values(by=['popularity', 'energy'], ascending=False)

This method uses the second criterion when the first one is not enough to distinguish entries.

#### Exercise 5 - Missing Values

Entries missing values are given the value NaN, short for "Not a Number". For technical reasons these NaN values are always of the float64 dtype. Knowing how to deal with missing values is an essential skill for any data scientist. Fortunately, pandas offers some methods to help us do that.

The data we've been working upon has no missing input occurrences. However, in real data, that happens only in very rare occasions. Run the following cell to obtain a more realistic dataset, by fabricating missing inputs. (This is not something you'll do often, in fact, if you have complete datasets, that's all the better for you! So don't waste your time here, go ahead!)

In [None]:
songs_missing_data = songs_data.copy()

songs_missing_data['popularity'][songs_missing_data.loc[:, 'popularity'] > 70] = np.nan

We'll start by selecting NaN entries in the DataFrame. The method pd.isnull() returns a boolean data representation refflecting wether or not each element corresponds to a missing input.

In [None]:
pd.isnull(songs_missing_data)

Applying this to a single column, we can then use this result to select exclusively the entries of the DataFrame with attributed values regarding the variable chosen. Perform this selection process according to the "popularity" variable.

In [None]:
select_info = #insert code here

selected_songs = songs_data[select_info]

selected_songs

As you can see, you lost a lot of entries. You need to get rid of all missing values in order to train machine learning models. For this reason, this exclusion process is not very viable, considering you'll lose a lot of information in the process. Instead, data scientists usually apply missing imputation methods. A simple imputation method offered by pandas relies on the substitution of all the missing instances for a specific value.

In [None]:
songs_missing_data.fillna("Unknown")

As you can see, all values were replaced by the element "Unknown". However, this is not very helpful for that variable specifically... There are other, better options of values to perform this substitution, such as statistical values related to that variable's distribution within the dataset. Apply the method presented, replacing the missing inputs with the mean value of the known entries, for the popularity variable.

In [None]:
mean_value = songs_missing_data.popularity.mean()

imputed_data = #insert code here

imputed_data

Now do the same, but using the median instead. (Tip there's a method to calculate median similar to the one used above to calculate the mean)

In [61]:
median_value = #insert code here

imputed_data = #insert code here

imputed_data


CAREFUL! In this case, we were able to apply the "fillna" method because only one variable had missing inputs. However, it is common for that to happen in more than one column. In that situtation, we could not apply this method using a single value for all variables, since they should present very distinct distributions.

Of course, this is still not ideal in some cases. In this dataset, we forged the missing values, and they were all correspondent to values outside the known data distribution, which is now limited to the value of 70. For this reason, these estimation methods are far from the true values that are missing. There are other, more robust ways to deal with missing inputs, however, we will not cover them in this tutorial.

#### Exercise 5 - Combining DataFrames

Sometimes, you might have more than one data source. Since you want to get as much useful data as possible to train your models, you need to consider all these data. Run the following cell, which will load a different dataset from the one we've been working.

In [78]:
subway_locations = pd.read_csv("subway_locations_in_us.csv")

Now check it's content (use the head() method)

In [None]:
df_head = #insert code here

df_head

Check it's shape

In [None]:
df_shape = #insert code here

df_shape

Let's pretend that both these datasets are related to the songs (as if each song was played in a specific subway location). However, we have only 2000 songs and 22645 subway locations represented. Let's pretend that only the 2000 first columns from the subway_locations DataFrame are related to our problem. Run the following cell. (This is purely hypothetical, just for the sake of this exercise).

In [73]:
subway_locations = subway_locations.loc[0:1999, :]

subway_locations.shape

Now we have compatible data. But it's still separated. We can solve that using the join method.

In [None]:
complete_df = songs_data.join(subway_locations)

complete_df.columns

As you can see, we now have variables from both datasets. We have, however, an unwanted column "Unnamed:0", which stores the indexes of the second DataFrame. We can get rid of it through the "drop" method.

In [None]:
complete_df = complete_df.drop("Unnamed: 0", axis='columns')

complete_df.columns

Et voilá! A new DataFrame composed by the content of both data sources.

You may find yourself in a situation where you want to concatenate information vertically, instead of horizontaly, i.e., you might want to add new entries (rows) to your dataset. To do that, you should use the method pd.concat, which receives a list of DataFrames as input and returns a merged version of them as a single DataFrame. Let's concatenate songs_data with a copy.

In [None]:
double_songs = pd.concat([songs_data, songs_data.copy()])

double_songs.shape

double_songs.head()

As you can see, it has doubled the number of entries, as we expected. Since the columns from both DataFrames are the same, there are no conflicts.

Pandas has a lot more methods to offer. You can explore its content here: https://pandas.pydata.org/