# Python Crash Course - Session 2

A Python package is a way to organize and structure code so it can be easily reused and shared. Packages save us from having to “reinvent the wheel,” since many of them already provide solutions to common problems. In this crash course, we’ll focus on the following packages:

- [NumPy](https://numpy.org/)
- [Pandas](https://pandas.pydata.org/)

### How to install and import packages

To use a package there are 2 separate steps that we must follow:
- Install the package in our enviroment
- Import the package into our code

We only need to install a package once using a **package manager** like `pip`, which is the most well-known one for Python. To do this, we simply use the `pip install` command in a terminal. Some modern platforms, like Jupyter Notebook, have built-in terminals that allow us to run these commands directly in a Python cell. (To make sure that Python understands it as a terminal command put a `!` before it.)


In [None]:
!pip install pandas

When we do this, the package manager will verify whether the package is already installed and whether the requirements are up to date. For example, to use version `2.2.2` of `pandas`, we need at least version `1.26.0` of `numpy`. All of this information is available on the Python Package Index (`PyPI`), which is the platform that `pip` queries to retrieve the latest package requirements.

#### Exercise 1
Lets try to install the `ydata-profiling` package in our environment.

In [None]:
# Write code here

To make things even easier, Colab already has most of the main packages used for data analysis installed in your environment. Therefore, the only thing you need to do is import them.

To do so simply use the the `import` keyword.

In [None]:
import numpy

With our package now imported, we are ready to start using it! To call any function from a package, we use the package name followed by a `.` and the function name. For example, we can use the `abs` function to find the absolute value of a number:

In [None]:
print(numpy.abs(-123))

#### Exercise 2

Create a variable with the values [1,2,3,4,5] and use numpy to get the mean of this list and print it.

**Hint:** The name of the function is `mean`.

In [None]:
# Write code here

Sometimes the names of packages are long or complicated. Therefore, it is common practice to create an alias for them. To do this, we use the `as` keyword after importing the package and specify the new alias. For example, `NumPy` is commonly imported as `np`.

In [None]:
import numpy as np

print(np.abs(-123))

## NumPy

### NumPy Array

Let's get started with NumPy, which is the fundamental package for scientific computing in Python. At the core of the NumPy package is the `ndarray`, or `NumPy Array`. 

The main advantage of a NumPy array is the ability to handle more than one dimension, which in NumPy are called axes, and use various operations across those dimensions.

-----------------------------------

NumPy array with 1 dimension containing 3 elements:

[1, 2, 3]

-----------------------------------

NumPy array with 2 dimensions, one with length 2 (the number of one-dimensional arrays) and the other dimension with length 3 (the number of elements inside those arrays).

[[1, 0, 0],

 [0, 1, 2]]

-----------------------------------

There are various ways to create a NumPy Array, the simplest is to call the function `array` and pass a list as its argument.

In [None]:
my_numpy_array = np.array([1, 2, 3])
print(my_numpy_array)

In [None]:
my_numpy_array = np.array([[1, 2], [3, 4]])
print(my_numpy_array)

### Dimensions and Shapes

To have a better grasp of the characteristics of these arrays we have 3 important attributes:
- `shape` = size of the array in each dimension
- `ndim` = number of dimensions
- `size` = total number of elements (all dimensions multiplied)

In [None]:
my_numpy_array = np.array([[1, 2], [3, 4]])
print(my_numpy_array)
print("-------------")
print(my_numpy_array.shape)
print(my_numpy_array.ndim)
print(my_numpy_array.size)

#### Exercise 3

Create NumPy arrays with the following shapes:
- (6,) -> 1 list containing 6 elements
- (2, 3) -> 1 list containing 2 list containing 3 elements each
- (3, 2) -> 1 list containing 3 list containing 2 elements each
- (2, 2, 3) -> 1 list containing 2 lists containing 2 lists containing 3 elements each

In [None]:
# Write code here

### Other ways to create an Array

We can also create an array of a chosen shape filled with either 0 or 1, using the functions `zeros` and `ones`, respectively.

In [None]:
zeros = np.zeros((3, 4))
ones = np.ones((4, 3))

print(zeros)
print("-----------------")
print(ones)

We can also create a random array filled with floats from 0 to 1 using the `random` function. Fot integer numbers we can use the `randint` function, but this time we need to specify the minimum and maximum values.

In [None]:
print(np.random.rand(2, 3))
print("-----------------")
print(np.random.randint(0, 100 ,(2, 3)))

Finally, we can also create an array using the `arange` function which will use a start value, a stop value and a step, and automatically create that list.

In [None]:
# Start on 10 and stop on 30 (exclusive) with a step of 30
print(np.arange(10, 30, 5))

#### Exercise 4
Create an array with all the even number between 2 and 100.

In [None]:
# Write code here

### Reshape and Flatten

We can change the default shape by using the `reshape` function.

In [None]:
print(np.arange(10, 30, 5))
print("-----------------")
print(np.arange(10, 30, 5).reshape(2, 2))

On the other hand, we can also use the `flatten` functions to convert any array into a one-dimensional array.

In [None]:
my_numpy_array = np.array([[1, 2], [3, 4], [5, 6]])

print(my_numpy_array)
print("-----------------")
print(my_numpy_array.flatten())

#### Exercise 5
Create an array with all the even number between 2 and 100 and with 2 dimenions. Create N rows of size 5.

**Hint:**  N is simply the size of the original list divided by the size of each array (5).

In [None]:
# Write code here

In [None]:
# Another simple way is to use -1 instead of doing the calculations
print(np.arange(2, 101, 2).reshape(-1, 5))

### Operations

Now that we know how to create a NumPy array, we can start using the operations that NumPy provides. The ones that involve only a single array include:
- `mean` = returns the mean of all the elements
- `sum` = returns the sum of all the elements
- `max` = returns the maximum value
- `min` = returns the minimum value
- `std` = returns the standard deviation

In [None]:
my_numpy_array = np.random.randint(-10, 101, (5,5))
print(my_numpy_array)
print("-----------------")
print(f"mean: {my_numpy_array.mean()}")
print(f"sum: {my_numpy_array.sum()}")
print(f"max: {my_numpy_array.max()}")
print(f"min: {my_numpy_array.min()}")
print(f"std: {my_numpy_array.std()}")

We can also use the `axis` parameter to apply operations along specific dimensions of the array:
- axis= 0 -> perform the operation down the rows (across the first dimension)
- axis= 1 -> perform the operation along the columns (across the second dimension)
- ...

In [None]:
print(f"mean of cols: {my_numpy_array.mean(axis=0)}")
print(f"mean of rows: {my_numpy_array.mean(axis=1)}")

#### Exercise 6

Create an array with the values from 0 to 999 with 2 dimensions, and with each row containing exactly 50 numbers.

Calculate the sum of each column.

In [None]:
# Write code here

Finally, it is also possible to do element-wise operations.

In [None]:
my_numpy_array = np.arange(1, 10)

print(my_numpy_array)
print(my_numpy_array + 5)
print(my_numpy_array - 5)
print(my_numpy_array * 5)
print(my_numpy_array / 5)
print(my_numpy_array ** 2)
print(my_numpy_array * my_numpy_array)

#### Exercise 7

Create a NumPy Array with shape (3,3) that contains the values 1 through 9. 

Multiply all elements by the sum of the elements in the second column, and print the array.

Then add 7 to every element and print the array.

In [None]:
# Write code here

### NaN

NumPy also provides a simple way to represent missing, undefined, or invalid numeric data by classifying those values as NaN (Not a Number).

In [None]:
print(np.nan)

NumPy has a function that safely removes NaN values before applying any operation. For instance, we can use `nansum` to achieve a sum tht will ignore the NaN values.

In [None]:
my_numpy_array = np.array([1, 2, np.nan])
print(np.sum(my_numpy_array))
print(np.nansum(my_numpy_array))

## Pandas

[Pandas](http://pandas.pydata.org) is a very popular Python package that provides data structures and data analysis tools.  It includes tools for reading and writing various data formats, processing data sets in an efficient DataFrame object, and the ability to reshape, filter, index, and subset data easily.

First, we need to import the Pandas package.  A very common convention is to import Pandas using the alias `pd`.

In [None]:
import pandas as pd

Similarly to NumPy, Pandas also introduces some new objects:
- `Series` -> one-dimensional labeled array (built on top a NumPy array)
- `DataFrame` -> two-dimensional data structure that holds data (built on top os Series)

### Series

Creating a Series in Pandas is very simple. We just need to create a new `Series` object and pass a list as a parameter.

In [None]:
my_series = pd.Series([1, 3, 5, np.nan, 6, 8])
print(my_series)

As you can see, a Series automatically creates an `index` (or `label`) for each observation. An interesting feature of a Series is that the index does not necessarily have to be an integer.

In [None]:
my_series = pd.Series([1, 3, 5, np.nan, 6, 8], index=("a", "b", "c", "d", "e", "f"))
print(my_series)

We can also create a Series from a dictionary. In this case, the dictionary keys are used as the index, and the values become the corresponding elements of the Series.

In [None]:
fruits = {"apples": 3, "bananas": 5, "oranges": 2}

my_series = pd.Series(fruits)
print(my_series)

A Series allows you to quickly find a value by its position, like an array, while also being able to retrieve a value by its key/index, like a dictionary.
- To search by position, we use the `iloc` property
- To search by index, we can use `loc` or simply the index itself, as by default the Series will select based on the index.

In [None]:
my_series = pd.Series([1, 3, 5, np.nan, 6, 8], index=("a", "b", "c", "d", "e", "f"))

print(my_series)
print("-----------------")
print(my_series["b"]) # By index
print(my_series.loc["f"]) # By index
print(my_series.iloc[2]) # By position

#### Exercise 8
In a Series, we can also apply NumPy operations. For example, let’s create a Series with numbers from 0 to 10 and print their sum.

In [None]:
# Write code here

### DataFrame

A `DataFrame` is a data structure in Pandas that uses Series to generate a table-like object that contains named columns of data. There are 4 main ways to create a dataframe:
- Using a dictionary of lists
- Using a NumPy array
- Using Series
- Reading from an external source (CSV, Excel, HTML, json, etc.)

In [None]:
# Using a dictionary of lists

data = {
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "city": ["New York", "Paris", "London"]
}

df = pd.DataFrame(data)
df

In [None]:
# Using a NumPy Array

data = np.array([[1, 2, 3], 
                [4, 5, 6]])

df = pd.DataFrame(data, columns=["A", "B", "C"])
df

In [None]:
# Using Series

data = pd.Series(range(5))

df = pd.DataFrame(data, columns=["Numbers"])
df

Probably the most important way to create a `DataFrame` is by reading from an external data source. Although every type of data source has its own specific function tthey all follow a similar naming pattern: `read_<source>`. 

For instance, we will be using the CSV file from [SAS Software](https://github.com/sassoftware/sas-viya-programming/blob/master/python/hands-on-workshop/Part%201%20-%20Crash%20Course%20in%20Pandas.ipynb), therefore the function that we should use is `read_csv`.

| Column      | Description                                                                 |
|-------------|-----------------------------------------------------------------------------|
| Make        | The manufacturer or brand of the car (e.g., Ford, Toyota, BMW).             |
| Model       | The specific model name of the car.                                         |
| Type        | The category of the vehicle (e.g., Sedan, SUV, Truck, Sports).              |
| Origin      | The region or country where the car was manufactured (e.g., USA, Asia).     |
| DriveTrain  | The drivetrain system, such as FWD (Front Wheel Drive), RWD, or AWD/4WD.    |
| MSRP        | Manufacturer’s Suggested Retail Price — the sticker price of the car.       |
| Invoice     | The dealer’s invoice price — the price the dealer pays the manufacturer.    |
| EngineSize  | The size of the engine in liters (e.g., 2.0, 3.5).                          |
| Cylinders   | The number of cylinders in the engine (e.g., 4, 6, 8).                      |
| Horsepower  | The engine’s power output, measured in horsepower.                          |
| MPG_City    | Miles per gallon the car achieves in city driving.                          |
| MPG_Highway | Miles per gallon the car achieves in highway driving.                       |
| Weight      | The weight of the car in pounds.                                            |
| Wheelbase   | The distance between the front and rear axles, in inches.                   |
| Length      | The total length of the car, in inches.                                     |


In [None]:
# Using an external source
data_path = "https://raw.githubusercontent.com/sassoftware/sas-viya-programming/master/data/cars.csv"
df = pd.read_csv(data_path)
df

#### Exercise 9

Create a simple file with 2 columns "Name" and "Age" and fill those with various values. Then upload the file Colab and read it using Pandas.

In [None]:
# Write code here

Sometimes we just want to have a quick look of the dataframe, for that we can use:
- The `head` function, to get the first N rows of the DataFrame
- The `tail` function, to get the last N rows of the DataFrame

By default the value of N is 5.

In [None]:
data_path = "https://raw.githubusercontent.com/sassoftware/sas-viya-programming/master/data/cars.csv"
df = pd.read_csv(data_path)

In [None]:
df.head()

In [None]:
df.tail()

### General Info

We can also use the following properties to get more information about the DataFrame:
- `columns` -> List of column names
- `dtypes` -> Series with the data types of the columns

In [None]:
df.columns

In [None]:
df.dtypes

We can also use the `info` function to print a concise summary of the DataFrame.

In [None]:
df.info()

### Indexes and Columns

We can select specific columns from a DataFrame either to create a smaller DataFrame with fewer columns, or to extract a single column as a Series. This is done using Python’s indexing syntax.

- Indexing with the name of a single column returns a Series.
- Indexing with a list of column names returns a new DataFrame.

In [None]:
model = df['Model']
model.head()

In [None]:
sub_df = df[['Make', 'Model', 'Horsepower']]
sub_df.head()

Similarly to a Series, we can also use `loc` and `iloc` to slice a DataFrame -> `[<start_row>:<end_row>, <start_col>:<end_col>, <step>]`. 

If we omit the row or column range, Pandas will default to including all rows and/or all columns. If we omit the step, Pandas will default to 1. 

In [None]:
df.loc[5:15]

In [None]:
df.iloc[5:15]

#### Exercise 10

Select only the first 4 columns of the `df` Dataframe.

In [None]:
# Write code here

#### Exercise 11

Select only the the even indexes from the `df` Dataframe.

**Hint:** We can use `step` to achieve this.

In [None]:
# Write code here

#### Exercise 12

Select from the `df` Dataframe:
- the rows 10 to 20
- the range of columns from "Make" to "Type"

In [None]:
# Write code here

We can also change the index column by using the `set_index` function.

In [None]:
df.set_index('Model')

An important thing to note is that the `set_index` function does not change the DataFrame in place by default. Instead, it returns a new DataFrame with the specified column set as the index.

If we don’t assign this result to a variable, the new DataFrame will be discarded.

In [None]:
df.head()

In order to change the actual index we need to set the parameter `inplace` to `True`, this will change the actual Dataframe.

In [None]:
df.set_index('Model', inplace=True)
df.head()

To reverse this we can simply call the  `reset_index` function.

In [None]:
df.reset_index(inplace=True)
df.head()

#### Exercise 13

From `df` get all the instances where the type is "Truck".

In [None]:
# Write code here

Finally, we can also use the `unique` function to get the unique values of a certain column.

In [None]:
df["Make"].unique()

### Boolean Indexing and Filtering

A more dynamic way to index DataFrames is through boolean indexing. Instead of specifying explicit index values, we use an expression to indicate which rows to select. 

This expression generates a boolean Series, where `True` marks the rows to keep and `False` marks the rows to exclude.

In [None]:
df["MSRP"] > 40000

We can now filter our DataFrame by only showing the rows where the condition is true.

In [None]:
df[df["MSRP"] > 40000].head()

You can also combine conditions using `&` for "and" and `|` for "or".

In [None]:
df[(df["MSRP"] > 40000) & (df["Cylinders"] > 8)].head()

#### Exercise 14

Create a df called `eu_sedan` which contains only sedans from Europe.

Create a df called `asia_suv` which contains only SUVs from Asia

In [None]:
# Write code here

In [None]:
asia_suv = df[(df["Origin"] == "Asia") & (df["Type"] == "SUV")]
asia_suv

### Sorting

Sorting can be done according to the index or column values.  The methods used to sort a DataFrame are ``sort_index`` for sorting by the index and ``sort_values`` for sorting by the data values.

In [None]:
df.sort_index().head()

In [None]:
df.sort_values(['Make', 'Horsepower']).head()

We can also choose the direction of the sort by using the `ascending` parameter.

In [None]:
# Checking the different makers of cars
df["Make"].unique()

In [None]:
# First sort in desncending order by the makers, and out of those, order the horsepower by ascensing order
df.sort_values(["Make", "Horsepower"], ascending=[False, True]).head()

#### Exercise 15 

Verify what is the model with the smallest engine size that has the highest horsepower from Audi.

In [None]:
# Write code here

### Simple Statistics

Pandas has a very useful and simple function, called `describe`, that computes various basic statistics for the entire DataFrame.

In [None]:
df.describe()

However, it is common to apply a trasnpose operation on this table, meaning to change the columns with the rows. To apply this step we simply need to add `T` to the end of the command.

In [None]:
df.describe().T

As its possible to see, not all columns appear, that occurs due to the different datatypes that exist. By default, `describe` will only show either the numerical values or the categorical value, which makes sense considering that depedning on that datatype diferent statistics must be computed.

We can quickly get a table with all column by using the `include` parameter and setting it to `all`.

In [None]:
df.describe(include="all").T

It is also possible to calculate individual statistics by using the appropriate functions:
- `count` -> number of non-null values
- `sum` -> sum of values
- `min` -> minimum
- `max` -> maximum
- `mean` -> mean
- `median` -> median
- `mode` -> mode
- `std` -> standard deviation

In [None]:
df.max()

However, it is always important to consider the various data types of the columns, for example, if we simply call the `mean` function we will get an error, this happens because Pandas will try to calculate an arithmetic mean on columns that don't contain numbers.

To solve this problem, we need to set the parameter `numeric_only` to true.

In [None]:
df.mean(numeric_only=True)

#### Exercise 16

Use `.describe()` on MSRP

Use `.min()` on MSRP

Use `.describe()` on Weight and Horsepower

Use `.mean()` on Weight and Horsepower

In [None]:
# Write code here

### Grouping Data

Another common operation in data analysis is grouping data by variable values. This is primarily done using the `groupby` method of DataFrames.

In [None]:
grpdf = df.groupby("Origin")
grpdf

You'll notice that in this case the returned value is a `DataFrameGroupBy` object.  Many of the methods available on a DataFrame will also work on this object.

In [None]:
grpdf["MSRP"].describe()

In [None]:
grpdf[["MSRP", "Horsepower"]].describe()

We can also choose specific operations.

In [None]:
grpdf.mean(numeric_only=True)

#### Exercise 17
Identify the top three car manufacturers that produce cars with the highest median horsepower.

In [None]:
# Write code here

### Operations on columns

Sometimes it is necessary to perform operations on specific columns, using constants, variables, or even other columns.

For example, let’s imagine the dataset owner informed us that there was an error in the number of cylinders column. The correct value should be the current value plus one.

In [None]:
df["Cylinders"] + 1

However, the changes did not happen...

In [None]:
df["Cylinders"]

What we did earlier simply returned a Series with the result of the operation. To actually update the DataFrame, we need to assign the result back to the column, treating it in the same way we would update a variable.

In [None]:
df["Cylinders"] = df["Cylinders"] + 1

In [None]:
df["Cylinders"] 

To add a new column we can use a similar approach.

In [None]:
df["newCol"] = 123

In [None]:
df.head()

To drop a column, we use the `drop` function and specify the column name(s) to remove using the `columns` parameter.

In [None]:
df.drop(columns=["newCol"])

However, just like when we changed the index of our DataFrame, the drop operation does not modify the DataFrame in place by default.

In [None]:
df.head()

To actually modify the DataFrame, we need to use the `inplace=True` parameter.

⚠️ Be careful when dropping columns, as this operation cannot be undone unless you reload the data.

In [None]:
df.drop(columns=['newCol'], inplace=True)
df.head()

#### Exercise 18

Create a new variable called `HighwayToCityMPG`, which is the miles per gallon on the highway divided by the miles per gallon in the city.

Then, determine if there is any model that consumes less fuel in the city than on the highway.

**Hint:** Verify if our new variable is less than 1, meaning that MPG_City > MPG_Highway.

In [None]:
df["HighwayToCityMPG"] = df["MPG_Highway"] / df["MPG_City"]

df[df["HighwayToCityMPG"] < 1]

### Plotting

There are several packages for creating plots in Python.  These include [matplotlib](http://matplotlib.org), [seaborn](https://stanford.edu/~mwaskom/software/seaborn/), [bokeh](http://bokeh.pydata.org/en/latest/), [plot.ly](https://plot.ly), etc.

Many of these packages such as seaborn and the Pandas plotting features use matplotlib in the background.  Packages like bokeh and plot.ly are primarily focused on graphics that are rendered in a web browser. These plotting packages will be explored in depth during the classes of Data Mining I.

When using pandas, some basic plotting features can be accessed in the `plot` method of the DataFrame. Let's create a scatter plot of the MSRP and horsepower values.

In [None]:
df.plot(kind='scatter', x='MSRP', y='Horsepower', figsize=(12,6))

## Copy by reference and by value

In most programming languages, there are two common ways to copy an object:

- **By Value:** A completely new and independent object is created, containing the same contents as the original. Changes made to the new object do not affect the original.
- **By Reference:** Instead of creating a new object, a new variable points to the same underlying object in memory. This means that modifications to one variable will also affect the other.


In [None]:
# Copy by reference (no actual copy, just another reference)
df_ref = df

# Changing the row with index 0 to 123
df_ref.loc[0] = 123 

df_ref.head()

In [None]:
df.head()

To avoid this behavior, we can use the `copy` function of pandas to create a completely new and independent DataFrame.

In [None]:
# Proper copy by value (independent object)
df_val = df.copy()

# Changing the row with index 0 to 456
df_val.loc[0] = 456

df_val.head()

In [None]:
df.head()

## Final Exercises

1 - Create a DataFrame by loading the data from the following URL: https://raw.githubusercontent.com/sassoftware/sas-viya-programming/master/data/cars.csv

In [None]:
# Write code here

2 - How many rows and columns are in the dataset?

In [None]:
# Write code here

3 - What are the different origins of cars?

In [None]:
# Write code here

4 - How many manufacturers exist for each origin?

**Hint:** the function `nunique` can be useful.

In [None]:
# Write code here

5 - Which manufacturer produces the third highest number of models?

In [None]:
# Write code here

6 - What is the 268th highest horsepower in the dataset?

In [None]:
# Write code here

7 - We will make some changes to the dataset. Please create an independent copy of it.

In [None]:
# Write code here

8 - Create a new column in the copied DataFrame called `ProfitMargin`, which is the difference between `MSRP` and `Invoice`.

In [None]:
# Write code here

9 - What are the models with the highest and lowest profit margins? What is the difference between those margins?

In [None]:
# Write code here

In [None]:
# Write code here

In [None]:
# Write code here

10 - Which manufacturer has, on average, the smallest engine size?

In [None]:
# Write code here

11 - I want to buy a car that meets the following criteria:

- Does not consume a lot of fuel, especially in the city.
- Has high horsepower.
- Is not very big, meaning it has a small length.
- Is affordable, must be below the median MSRP.

To choose the best possible car, I want to rank each model for each criterion. For example, the 22nd best car in terms of fuel consumption might be the 48th best in terms of length and 10th in terms of horsepower.

The score of that car would then be: 22 + 48 + 10 = 80

The car with the smallest total score would be the optimal choice.

Which model should I buy based on this scoring method?

In [None]:
# Write code here

Good job! :)