<a href="https://colab.research.google.com/github/brendanpshea/data-science/blob/main/Data_Science_02_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Data Manipulation with Pandas
## Brendan Shea, PhD
Building upon the foundation laid in Chapter 1, we delve further into the world of Python and pandas in this next chapter. Having introduced the mtcars dataset and explored some of its basic features, we now focus on mastering more advanced data manipulation techniques. Python and pandas offer us the tools to filter, sort, and transform our dataset in various ways, enabling us to extract increasingly sophisticated insights from it.

In this chapter, we continue to work with the mtcars dataset, a collection of data points about various car models. Its diverse features allow us to explore a myriad of data manipulation techniques, sharpening our skills and broadening our understanding of the practical application of Python and pandas.

However, as we delve deeper into these techniques, it's not just about the how, but also the why. What's going on behind these functions and methods we're using? The answer lies in algorithms, the heart of many operations in pandas. An algorithm is like a recipe, guiding the computer step by step to achieve our intended result.

In this chapter, we will unmask the role of algorithms in our data exploration, bringing them to the forefront of our learning journey. We will highlight examples of algorithms in the operations we use to manipulate the mtcars dataset and discuss their impact on the results of our data analysis.

As we reveal the machinery of algorithmic processing, we will also reflect on the philosophical implications of using algorithms. They can be powerful and efficient tools but also carry risks such as biases. Thus, understanding algorithms extends beyond the technical: it invites us to engage with important ethical questions.

By the end of this chapter, you will have developed a deeper understanding of data manipulation techniques, gained a clearer perspective on the role and influence of algorithms, and cultivated a sense of the ethical considerations in data science. Let's continue our journey into the fascinating world of Python and pandas.

In [None]:
!pip install pydataset -q # Install required packages
from pydataset import data # Import required modules
import pandas as pd # More on this below

mtcars_df = data('mtcars') # Load the mtcars dataset

mtcars_df.head() # display first five rows

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.9/15.9 MB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for pydataset (setup.py) ... [?25l[?25hdone
initiated datasets repo at: /root/.pydataset/


Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


## Revisiting the mtcars Dataset

The mtcars dataset is a valuable resource for our study due to its wide range of attributes relating to various aspects of automobile design and performance. Each row in the dataset represents a different car model from the 1973-74 Motor Trend magazine issues, and each column represents a different attribute, such as miles per gallon (mpg), number of cylinders (cyl), and horsepower (hp), among others.

To reacquaint ourselves with the mtcars dataset, we begin by importing the pandas library and loading the dataset. Once loaded, we can view the first few rows of the DataFrame using the `head()` function to get a quick overview of the data structure.



The mtcars dataset is composed of several columns, each representing a different feature:

-   mpg: Miles/(US) gallon
-   cyl: Number of cylinders
-   disp: Displacement (cu.in.)
-   hp: Gross horsepower
-   drat: Rear axle ratio
-   wt: Weight (1000 lbs)
-   qsec: 1/4 mile time
-   vs: V/S
-   am: Transmission (0 = automatic, 1 = manual)
-   gear: Number of forward gears
-   carb: Number of carburetors

The `describe()` and `info()` functions are powerful tools for exploring the DataFrame at a more detailed level. The `describe()` function provides summary statistics for each column, such as mean, standard deviation, and quartile values, giving a mathematical overview of the dataset. On the other hand, `info()` provides a concise summary of the DataFrame, including the number of non-null entries in each column and their data types, which is particularly useful in identifying missing values and ensuring that each column contains the appropriate data type for our analysis.

Remember, understanding your data is the first step to effective data analysis. With these tools at hand, you're well equipped to begin digging deeper into the mtcars dataset.

In [None]:
round(mtcars_df.describe(), 2)

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
count,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0
mean,20.09,6.19,230.72,146.69,3.6,3.22,17.85,0.44,0.41,3.69,2.81
std,6.03,1.79,123.94,68.56,0.53,0.98,1.79,0.5,0.5,0.74,1.62
min,10.4,4.0,71.1,52.0,2.76,1.51,14.5,0.0,0.0,3.0,1.0
25%,15.42,4.0,120.82,96.5,3.08,2.58,16.89,0.0,0.0,3.0,2.0
50%,19.2,6.0,196.3,123.0,3.7,3.32,17.71,0.0,0.0,4.0,2.0
75%,22.8,8.0,326.0,180.0,3.92,3.61,18.9,1.0,1.0,4.0,4.0
max,33.9,8.0,472.0,335.0,4.93,5.42,22.9,1.0,1.0,5.0,8.0


## Selecting and Filtering Data in pandas

The process of data analysis often involves sifting through large amounts of information to find what's most relevant to your specific task or question. This is where data selection and filtering come into play. In data analysis, **filtering** refers to the process of specifying conditions to isolate a subset of your data that meets those conditions. For example, you might be interested in analyzing only those rows where a certain value exceeds a defined threshold or those columns that represent specific variables of interest. This enables us to focus on the most pertinent data and ignore extraneous details, streamlining our analysis process and reducing computational load.

Pandas provides a wide array of functionalities for data selection and filtering, allowing us to extract precisely what we need from a DataFrame.

### Column Selection:

You can select columns simply by referring to their names. For example, you might want to select the 'mpg' (miles per gallon) column from the `mtcars` DataFrame:

In [None]:
# Selecting a single column (limit to 5 rows)
mtcars_df['mpg'].head()

Mazda RX4            21.0
Mazda RX4 Wag        21.0
Datsun 710           22.8
Hornet 4 Drive       21.4
Hornet Sportabout    18.7
Name: mpg, dtype: float64

You can also pass multiple rows by passing in their names:

In [None]:
# Selecting multiple columns
mtcars_df[['mpg', 'hp', 'am']].head()

Unnamed: 0,mpg,hp,am
Mazda RX4,21.0,110,1
Mazda RX4 Wag,21.0,110,1
Datsun 710,22.8,93,1
Hornet 4 Drive,21.4,110,0
Hornet Sportabout,18.7,175,0


### Row Selection

Row selection is performed using either the `iloc` or `loc` function. `iloc` is used for index-based selection, while `loc` is used for label-based selection.

For example, to select the first five rows:

In [None]:
# Selecting first five rows using index-based selection
mtcars_df.iloc[0:5]


Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


We can also select rows by "label":

In [None]:
# Selecting rows by label
mtcars_df.loc["Mazda RX4"]


mpg      21.00
cyl       6.00
disp    160.00
hp      110.00
drat      3.90
wt        2.62
qsec     16.46
vs        0.00
am        1.00
gear      4.00
carb      4.00
Name: Mazda RX4, dtype: float64

### Conditional Filtering

Pandas allows us to filter our data based on specific conditions. For example, we might want to select cars from our mtcars DataFrame that have a miles per gallon (`mpg`) rating greater than 25:

In [None]:
# Filtering rows where mpg > 25
mtcars_df[mtcars_df['mpg'] > 25]

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Fiat 128,32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1
Honda Civic,30.4,4,75.7,52,4.93,1.615,18.52,1,1,4,2
Toyota Corolla,33.9,4,71.1,65,4.22,1.835,19.9,1,1,4,1
Fiat X1-9,27.3,4,79.0,66,4.08,1.935,18.9,1,1,4,1
Porsche 914-2,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2
Lotus Europa,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2


We can combine multiple conditions using the `&` (and) or `|` (or) operators. For example, to select cars with an mpg greater than 20 and an automatic transmission:

In [None]:
# Filtering rows where mpg > 20 and the car has an automatic transmission
mtcars_df[(mtcars_df['mpg'] > 20) & (mtcars_df['am'] == 0)]

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
Toyota Corona,21.5,4,120.1,97,3.7,2.465,20.01,1,0,3,1


Remember to enclose each condition in parentheses when combining them, to ensure that each condition is properly evaluated.

Mastering these data selection and filtering methods in pandas will enable you to perform more targeted and efficient data analysis. They provide a way to zoom in on the most relevant parts of your data, making your analyses more precise and insightful.

## Sorting Data in pandas

Order is often an important factor in data analysis. By sorting our data, we can quickly identify patterns, outliers, or specific entries that might otherwise get lost in an unordered dataset. In pandas, we have the `sort_values()` function, which enables us to arrange our DataFrame in ascending or descending order based on the values of one or more columns.

The `sort_values()` function takes as arguments the names of the columns you want to sort by. By default, the function sorts in ascending order, but you can reverse this with the `ascending=False` option.

For example, to sort the `mtcars` DataFrame by miles per gallon (mpg) in ascending order, you would write:

In [None]:
# Sorting by 'mpg' in ascending order (limit to 5)
mtcars_df.sort_values('mpg').head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Lincoln Continental,10.4,8,460.0,215,3.0,5.424,17.82,0,0,3,4
Cadillac Fleetwood,10.4,8,472.0,205,2.93,5.25,17.98,0,0,3,4
Camaro Z28,13.3,8,350.0,245,3.73,3.84,15.41,0,0,3,4
Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
Chrysler Imperial,14.7,8,440.0,230,3.23,5.345,17.42,0,0,3,4


To sort in descending order:

In [None]:
# Sorting by 'mpg' in descending order
mtcars_df.sort_values('mpg', ascending=False).head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Toyota Corolla,33.9,4,71.1,65,4.22,1.835,19.9,1,1,4,1
Fiat 128,32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1
Lotus Europa,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2
Honda Civic,30.4,4,75.7,52,4.93,1.615,18.52,1,1,4,2
Fiat X1-9,27.3,4,79.0,66,4.08,1.935,18.9,1,1,4,1


You can also sort by multiple columns. When sorting by multiple columns, the function sorts the DataFrame based on the first column first, then sorts within those results based on the second column, and so on. For example, to sort by both the number of cylinders (`cyl`) and `mpg`, you would write:

In [None]:
# Sorting by 'cyl' and then 'mpg' in ascending order (limit to 5)
mtcars_df.sort_values(['cyl', 'mpg']).head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Volvo 142E,21.4,4,121.0,109,4.11,2.78,18.6,1,1,4,2
Toyota Corona,21.5,4,120.1,97,3.7,2.465,20.01,1,0,3,1
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2


Using `sort_values()` function effectively can greatly enhance your ability to understand and interpret your data. By arranging your data in a meaningful order, you make it easier to find specific entries and observe patterns, outliers, or trends.

Exercises: Searching and Sorting

1.   Using the `mtcars` DataFrame, select the rows for cars that have more than 200 horsepower (`hp`).

    *Hint: Remember how we used conditional filtering to select rows based on a specific condition?*

2.  From the `mtcars` DataFrame, select the cars that weigh more than 3,000 lbs (`wt`) and have an automatic transmission (`am`=0).

    *Hint: You can combine multiple conditions using the `&` (and) operator.*

3.  Select the columns `gear`, `carb`, and `wt` from the `mtcars` DataFrame.

    *Hint: For selecting multiple columns, you need to pass a list of column names.*

4.  Select the 5th to the 10th rows from the `mtcars` DataFrame.

    *Hint: You can use index-based selection with `iloc` for this task.*

5.  Sort the `mtcars` DataFrame in ascending order based on the `qsec` (1/4 mile time) column.

    *Hint: You can use the `sort_values()` function and specify the column you want to sort by.*

6.  Sort the `mtcars` DataFrame in descending order based on the `disp` (displacement) column.

    *Hint: To sort in descending order, set `ascending=False` in the `sort_values()` function.*

7.  Sort the `mtcars` DataFrame first by the `gear` column and then within each gear group, sort by the `carb` (number of carburetors) column.

    *Hint: To sort by multiple columns, pass a list of column names to the `sort_values()` function.*

In [None]:
# Ex 1

In [None]:
# Ex 2

In [None]:
# Ex 3

In [None]:
# Ex 4

In [None]:
# Ex 5

In [None]:
# Ex 6

In [None]:
# Ex 7

## Introduction to Functions and Algorithms

Functions are fundamental building blocks in Python and many other programming languages. Essentially, a function is a reusable piece of code that performs a specific task. It takes input in the form of arguments and returns a result based on these inputs.

Here's a simple function that multiplies a number by two:

In [None]:
def multiply_by_two(x):
    return x * 2

In this function, `x` is the input, and `x * 2` is the output. We can call this function with any numeric input we like, for example:

In [None]:
print(multiply_by_two(5))  # Outputs: 10

10


Functions are closely tied to the concept of algorithms. In the most general sense, an **algorithm** is a step-by-step procedure to perform a specific task or solve a particular problem. Therefore, you can think of functions as implementations of algorithms - they take **inputs** (or **parameters**), perform specific steps (the algorithm), and provide outputs (or **return values**). In some cases, either the inputs, outputs, or both might be **null** (that is, "nothing").

For instance, imagine we want to calculate the power-to-weight ratio for cars in the `mtcars` DataFrame. The power-to-weight ratio is a measure of performance and is calculated as horsepower divided by weight. We can create a function to perform this calculation:

In [None]:
def power_to_weight(hp, wt):
    return hp / wt

With this function, we can calculate the power-to-weight ratio for any car, given its horsepower and weight:

In [None]:
print(power_to_weight(110, 2.875))  # Outputs: 38.26 (approximately)

38.26086956521739


This function, `power_to_weight`, is an implementation of an algorithm - it takes two inputs (horsepower and weight), performs a specific calculation (division), and returns a result (the power-to-weight ratio).

Now, let's briefly look at a few more algorithms.

### Algorithm: Calculating Car Efficiency

When evaluating a car, one common measure is the fuel efficiency. We can define an algorithm to compute a fuel efficiency score by taking into account miles per gallon (mpg) and the number of cylinders (cyl). The more mpg and fewer cylinders, the more efficient a car is considered to be.

In [None]:
def car_efficiency(mpg, cyl):
    # Compute the efficiency score as mpg divided by the number of cylinders
    return mpg / cyl

### Algorithm: Analyzing Car Performance

Another aspect to consider when evaluating a car is its performance. For this, we can create a performance score based on horsepower (hp), quarter mile time (qsec), and weight (wt). More horsepower and less quarter mile time indicate better performance, but we'll normalize by weight to account for the effect of the car's mass.

In [None]:
def car_performance(hp, qsec, wt):
    # Compute the performance score as horsepower divided by quarter mile time, normalized by weight
    return (hp / qsec) / wt

### Algorithm: Categorizing Car Type

Let's imagine we want to categorize the cars in our dataset. To do so, we can define an algorithm that examines several features of each car and assigns it to one of the categories based on its characteristics.

Consider three categories: 'Sports Car', 'Grocery Getter', and 'Roadster'. We define the categories based on the following criteria:

1. 'Sports Car': Cars with an automatic transmission (am=1), more than 4 cylinders (cyl>4), and horsepower greater than 150 (hp>150). These are cars built for speed and power.

2. 'Grocery Getter': Cars with high miles per gallon (mpg>20) and less horsepower (hp < 100). These cars are optimized for efficiency and practicality.

3. 'Roadster': The remaining cars that do not fit into the other categories. These could be cars that balance performance and efficiency, offering a versatile driving experience.

Here is how we can implement this categorization algorithm in Python:

In [None]:
def categorize_car(cyl, am, hp, mpg):
    # Check if the car meets the criteria to be considered a sports car
    if cyl > 4 and am == 1 and hp > 150:
        return 'Sports Car'
    # Check if the car meets the criteria to be considered a grocery getter
    elif mpg > 20 and hp < 100:
        return 'Grocery Getter'
    # All remaining cars are considered roadsters
    else:
        return 'Roadster'


## Creating New Columns
Pandas DataFrame provides a straightforward way to create new columns based on operations on existing columns. Let's explore how to do this.

If we want to create a new column that is the 'mpg' (miles per gallon) column multiplied by 2, we could do this:

In [None]:
mtcars_df['double_mpg'] = mtcars_df['mpg'] * 2
mtcars_df.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,double_mpg
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4,42.0
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4,42.0
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1,45.6
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1,42.8
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2,37.4


Here's what's happening:

-   `mtcars_df['mpg']` selects the 'mpg' column of our DataFrame.
-   `* 2` multiplies every value in the 'mpg' column by 2.
-   `mtcars_df['double_mpg'] =` creates a new column in our DataFrame called 'double_mpg' that stores the results.

After running this code, every value in the 'double_mpg' column will be twice the corresponding value in the 'mpg' column.

You can do the same for division. For example, to create a new column that is the ratio of 'hp' (horsepower) to 'mpg', you could do this:

In [None]:
mtcars_df['hp_to_mpg_ratio'] = mtcars_df['hp'] / mtcars_df['mpg']
mtcars_df.head()


Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,double_mpg,hp_to_mpg_ratio
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4,42.0,5.238095
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4,42.0,5.238095
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1,45.6,4.078947
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1,42.8,5.140187
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2,37.4,9.358289


In this line of code, every value in the 'hp' column is divided by the corresponding value in the 'mpg' column, and the result is stored in a new column named 'hp_to_mpg_ratio'.

### Using Functions to Create New Columns
We've written a few custom functions to calculate car efficiency, power-to-weight ratio, car performance, and categorize car type. These are a bit more complex than simply multiplying or dividing columns. However, we can still use them to create new columns in our DataFrame. We just have to call the function and pass in the appropriate column(s) as arguments.

Here's how we can do this:

**Car Efficiency:** We defined this as miles per gallon divided by the number of cylinders. We can create a new 'efficiency' column using our `car_efficiency()` function and the 'mpg' and 'cyl' columns:

In [None]:
mtcars_df['efficiency'] = car_efficiency(mtcars_df['mpg'], mtcars_df['cyl'])
mtcars_df.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,double_mpg,hp_to_mpg_ratio,efficiency
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4,42.0,5.238095,3.5
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4,42.0,5.238095,3.5
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1,45.6,4.078947,5.7
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1,42.8,5.140187,3.566667
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2,37.4,9.358289,2.3375


**Power-to-Weight Ratio:** This is the horsepower divided by the weight. We can create a new 'power_to_weight' column using our `power_to_weight()` function and the 'hp' and 'wt' columns:

In [None]:
mtcars_df['power_to_weight'] = power_to_weight(mtcars_df['hp'], mtcars_df['wt'])
mtcars_df.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,double_mpg,hp_to_mpg_ratio,efficiency,power_to_weight
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4,42.0,5.238095,3.5,41.984733
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4,42.0,5.238095,3.5,38.26087
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1,45.6,4.078947,5.7,40.086207
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1,42.8,5.140187,3.566667,34.214619
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2,37.4,9.358289,2.3375,50.872093


**Car Performance:** We defined this as horsepower divided by the product of quarter mile time and weight. We can create a new 'performance' column using our `car_performance()` function and the 'hp', 'qsec', and 'wt' columns:

In [None]:
mtcars_df['performance'] = car_performance(mtcars_df['hp'], mtcars_df['qsec'], mtcars_df['wt'])

mtcars_df.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,double_mpg,hp_to_mpg_ratio,efficiency,power_to_weight,car_type,performance
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4,42.0,5.238095,3.5,41.984733,Roadster,2.550713
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4,42.0,5.238095,3.5,38.26087,Roadster,2.247995
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1,45.6,4.078947,5.7,40.086207,Grocery Getter,2.154014
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1,42.8,5.140187,3.566667,34.214619,Roadster,1.760011
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2,37.4,9.358289,2.3375,50.872093,Roadster,2.98896


**Car Type:** We categorized cars as 'Sports Car', 'Grocery Getter', or 'Hill Climber' based on their cylinders, transmission, horsepower, and mpg. We can create a new 'car_type' column using our categorize_car() function and the appropriate columns.

You'll note here we need to use `lambda` (which allows us to define short unnamed functions) and `df.apply()` (which allows us to APPLY functions). WHile you don't need to know the details of these (yet), they are powerful tool when working with more complex functions and algorithms.

In [None]:
mtcars_df['car_type'] = mtcars_df.apply(lambda x: categorize_car(x['cyl'], x['am'], x['hp'], x['mpg']), axis=1)
mtcars_df.head()


Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,double_mpg,hp_to_mpg_ratio,efficiency,power_to_weight,car_type
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4,42.0,5.238095,3.5,41.984733,Roadster
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4,42.0,5.238095,3.5,38.26087,Roadster
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1,45.6,4.078947,5.7,40.086207,Grocery Getter
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1,42.8,5.140187,3.566667,34.214619,Roadster
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2,37.4,9.358289,2.3375,50.872093,Roadster


Here's what the code does:

-   The `lambda x:` part defines a new temporary function with `x` as an argument.
-   `x` in this context is a row of the DataFrame (because we specified `axis=1` in `apply()`).
-   `categorize_car(x['cyl'], x['am'], x['hp'], x['mpg'])` is the function we want to apply to each row.

So for each row in `mtcars_df`, we're applying `categorize_car()` to the 'cyl', 'am', 'hp', and 'mpg' columns of that row, and the result is stored in the new 'car_type' column.

## Exercise: Create New Column

In this exercise, you will create a new column in the `mtcars_df` DataFrame, which calculates the 'power to displacement' ratio for each car. This is defined as the horsepower (hp) divided by the engine displacement (disp).

Step 1: Write a Python function named `power_to_displacement()` that accepts two arguments 'hp' and 'disp'. The function should return the ratio of hp to disp.

Remember, a Python function is defined using the `def` keyword, followed by the function name and parentheses `()`. Inside the parentheses, we include any input parameters the function will use. The function body, which includes the operations we want the function to carry out, is indented under the function definition and typically ends with a `return` statement that specifies the output of the function.

Step 2: Create a new column in the `mtcars_df` DataFrame called 'power_to_disp'. To populate this column, apply your `power_to_displacement()` function to the 'hp' and 'disp' columns in the DataFrame.

Remember, to apply a function to a column in a DataFrame, we can simply call the function and pass the column as an argument, like so: `DataFrame['new_column'] = function(DataFrame['existing_column1'], DataFrame['existing_column2'])`.

Hint: Be sure to check the new column in your DataFrame to ensure it was created correctly. You can do this by using `DataFrame.head()`, which will show you the first few rows of the DataFrame, including your new column.



In [None]:
# Step 1: Write a Python function named power_to_displacement()

In [None]:
# Step 2: Create a new column in the mtcars_df DataFrame called 'power_to_disp'.

In [None]:
# Now, view the head of the dataframe

## Case Study: The Central Role of Algorithms in Data Science

Data science is sometimes defined as the discipline of making data useful. At its core, this involves extracting meaningful insights from raw data, which requires more than just statistical analysis or visualization techniques. It requires a critical tool in the data scientist's toolbox - the algorithm.

An **algorithm** is a step-by-step procedure or a set of rules for performing a specific task. In the realm of computer science and programming, an algorithm is often understood as a self-contained sequence of actions to be performed, which can receive zero or more inputs and produce an output. Algorithms describe the solution to a problem in terms of the data needed to represent the problem instance and the set of steps necessary to produce the intended result.

Algorithms play a pivotal role in turning data into knowledge. They offer a methodical and systematic way to process, analyze, and draw conclusions from data. From simple calculations to advanced machine learning models, algorithms are the backbone of data-driven decision-making. Understanding what an algorithm is, how it works, and how to create one is fundamental to being effective in data science.

### Examples of Algorithms

The importance of algorithms transcends industries, and the automotive industry (the topic of this chapter's data set) is no exception. Algorithms are integral to various aspects of the automotive world, from optimizing engine performance and fuel efficiency, to powering the advanced systems behind self-driving cars. Below are some examples:

- Fuel Efficiency Algorithm: An example from our previous exercises, the `car_efficiency()` function, can be seen as an algorithm. It takes the miles per gallon (mpg) and the number of cylinders (cyl) as inputs and calculates the efficiency of a car as the ratio of mpg to cyl.

- Car Type Classification Algorithm: The `categorize_car()` function uses certain characteristics of a car (like the number of cylinders, the type of transmission, horsepower, and miles per gallon) to classify it into a specific type - a 'Sports Car', 'Grocery Getter', or 'Roadster'.

- Self-Driving Car Algorithms: Modern self-driving cars are powered by complex algorithms that take inputs from multiple sensors (like cameras, lidar, and radar) and use these to make decisions about steering, acceleration, and braking. These algorithms use advanced techniques from machine learning and artificial intelligence and can adapt to new situations and learn from experience.

- Engine Control Unit (ECU) Algorithms: Every modern car has an ECU, a small computer that controls the engine, including the fuel mix, the ignition timing, and variable valve timing. These control algorithms optimize the engine's performance, fuel economy, and emissions.

### When is it Not an Algorithm?

Not everything related to cars (or life) a involves algorithms. Here are some examples:

- Car Specifications: Specifications or features of a car like its color, dimensions, weight, etc., are not algorithms. They are static information about the car and do not involve any step-by-step procedures or operations.

- Driving: While driving involves a sequence of actions, it's typically not considered an algorithm in the computer science sense because human decision-making can be inconsistent and unpredictable. However, the process of automating driving has involved translating aspects of it into algorithms for self-driving cars.

- Car Manufacturing: The manufacturing process of a car involves a sequence of operations. However, it's more of a process or a procedure than a computer algorithm. Although, parts of it could be automated and controlled by algorithms, the whole process itself isn't one.

By understanding the importance and role of algorithms in data science and the automotive industry, we can better appreciate their fundamental role in shaping the progress of technology and society. With this knowledge, we are better equipped to leverage these tools and contribute to the future of these fields.

### Designing Algorithms: From Simplicity to Complexity

The process of designing algorithms is a pivotal part of computational programming and data science. It involves translating a real-world problem into a step-by-step computational procedure. The complexity of algorithms can vary widely, from simple arithmetic operations to highly sophisticated machine learning models. Let's embark on this journey, starting from the simplest algorithms to the most complex ones.

**Simple Algorithms: The Foundation.** At their most basic, algorithms are systematic procedures for calculations. A simple algorithm might just involve applying a basic mathematical operation. For example, an algorithm to calculate the average speed of a car could involve a single step: divide the distance traveled by the time taken. Such an algorithm is straightforward, based on a clear mathematical formula.

**Pseudocode: Bridging the Gap.**  As algorithms become more complex, it is often beneficial to first represent them as pseudocode, a language-neutral way of expressing the algorithm. Pseudocode focuses more on the logic of the algorithm rather than the syntactical details of a specific programming language.

For instance, an algorithm to calculate fuel efficiency might look like this in pseudocode:

```
Procedure CalculateFuelEfficiency is
    Input: list of distances (in miles) and list of fuel consumed (in gallons) for each trip
    Output: list of fuel efficiencies for each trip (in MPG)

    for each trip in the list of trips do
       calculate fuel efficiency as distance/fuel
       store the fuel efficiency
    end for

    return the list of fuel efficiencies
    end Procedure
```

**Algorithms as Models: Representing the Real World** Algorithms in data science often function as **models** -- approximations or simplifications of reality that highlight specific relationships between variables. These algorithms take data as inputs, apply a series of computations, and produce outputs like predictions or classifications.

For example, a linear regression algorithm models the relationship between two variables as a straight line. The simplicity of this model makes it easy to interpret, but it might not capture more complex relationships in the data. Hence, the choice of an algorithm or model depends on the task at hand, the complexity of the data, and the trade-off between simplicity and accuracy that the data scientist is willing to make.

**Complex Algorithms: Mimicking Human Intelligence.** On the frontier of technology, algorithms reach a level of complexity that attempts to mimic human intelligence (**artifiical intelligence**). Consider self-driving cars: they operate using a multitude of algorithms that take data from various sensors, make sense of the surrounding environment, make decisions based on this interpretation, and act on these decisions.

One key technology in self-driving cars is the **LiDAR (Light Detection and Ranging)** system. The algorithm that processes this data and creates a 3D map of the environment is complex and involves knowledge from various domains like physics, signal processing, and machine learning.

Designing algorithms to accomplish tasks typically performed by humans involves a number of challenges. Human tasks involve a high degree of complexity and nuance that is difficult to capture in an algorithm. Additionally, algorithms need to handle a variety of real-world uncertainties and edge cases. Despite these challenges, there has been significant progress in algorithm design, particularly with the advent of advanced machine learning and AI techniques.

### Algorithmic Bias: What, Why, and How?

Algorithmic bias refers to the systematic and repeatable errors in the output of an algorithm that produce unfair results or outcomes. It is often a reflection of existing biases in society, and can be particularly problematic when algorithms are used in decision-making processes that affect people's lives. For example, if an algorithm is used to determine insurance premiums and it systematically charges higher premiums to certain groups of people, it may be considered biased.

Let's delve into why it is a significant problem, its examples in the automobile industry, and the main sources of such bias.

**Why is Algorithmic Bias a Problem?** Algorithmic bias is a problem because algorithms are increasingly used to make decisions that impact people's lives in significant ways. When an algorithm is biased, it can perpetuate existing social biases and inequalities. It may also make decisions that are unfair or discriminatory. In the context of cars, if an algorithm used to predict the risk of accidents is biased against certain types of cars or drivers, it could lead to higher insurance premiums or even denial of coverage for those groups.

**Examples of Algorithmic Bias** One example of potential algorithmic bias is in the algorithms used by ride-sharing companies like Uber or Lyft. These algorithms determine the price of a ride based on demand, which could lead to higher prices in certain neighborhoods. If these neighborhoods are predominantly inhabited by low-income or minority populations, it could be argued that the algorithm is biased.

Another example is in the world of self-driving cars. The algorithms used to operate these vehicles need to recognize a variety of objects in order to function safely. However, if the data used to train these algorithms contains mostly images of people with light skin tones, the algorithm might be less effective at detecting people with darker skin tones, potentially leading to dangerous situations.

### Five Main Sources of Algorithmic Bias

Understanding where algorithmic bias comes from is the first step towards mitigating it. Here are the five main sources:

-   **Pre-existing Bias:** Bias can exist in the societal norms and structures which produced the data. For example, if a traffic stop dataset contains a disproportionate number of certain racial groups, the algorithms trained on this data might perpetuate this bias.

-   **Sample Bias:** This occurs when the data used to train an algorithm is not representative of the population the algorithm will be applied to. If an autonomous car's image recognition algorithm is trained on data primarily collected during daytime, it may perform poorly at night.

-   **Preparation Bias:** During data cleaning and preprocessing, decisions about handling outliers and missing values can introduce bias. For instance, removing older car models from a dataset because they are less common could cause an algorithm to perform poorly when it encounters such models.

-   **Algorithmic Bias:** This is bias that arises from the way the algorithm processes data. For example, an algorithm might give more weight to certain features or variables, causing it to be biased in favor of those features.

-   **Interpretation Bias:** Even if an algorithm is perfectly designed and implemented, bias can be introduced in the way its results are interpreted and applied.

Understanding and addressing algorithmic bias is a crucial aspect of ethical data science. It ensures that the insights and decisions derived from data are fair and equitable, which is especially important as algorithms play an increasingly prominent role in our society.


## Questions: Algorithms and Bias

1. In the process of designing an algorithm, programmers make choices based on certain assumptions. Can you think of a car-related algorithm and the potential assumptions that might be made during its design? How might these assumptions impact the function and effectiveness of the algorithm?

2. We've discussed how algorithmic bias can influence outcomes in the car industry, such as the performance of self-driving cars or the pricing model of ride-sharing apps. Can you think of other real-world situations where algorithmic bias could have significant impacts? How might these biases influence the lives of individuals or communities?

3. As we learned, there are several sources of algorithmic bias: pre-existing bias, sample bias, preparation bias, algorithmic bias, and interpretation bias. Choose one of these sources and discuss potential strategies to reduce or eliminate the bias. Consider both the technical aspects (e.g., changes to the algorithm or data) and societal aspects (e.g., policies or regulations). What might be some challenges in implementing these strategies?

## My Answers: Algorithms and Bias
1.


2.


3.

## Glossary

| Term | Definition |
| --- | --- |
| Algorithm | A specific set of instructions designed for a task. In data science, it's the blueprint for processing and analyzing data. |
| Function | A particular instance of an algorithm, typically a code block in programming. It performs a specified task and can accept inputs and produce outputs. |
| Parameter (function) | An input to a function. It's the means by which algorithms process different data or vary their behavior. |
| Return value (function) | The output of a function. It's the direct result of the algorithm's steps and can be used in further computations or as the final output. |
| Pseudocode | A high-level representation of an algorithm. It focuses on logic, ignoring syntax, thus simplifying the design and explanation of algorithms. |
| Model | A representation or simulation of a system, generated by algorithms in data science, used for understanding patterns, making predictions, or testing hypotheses. |
| LiDAR | A technology employing light-based algorithms to measure distances, often used to create detailed spatial representations. |
| Algorithmic Bias | The skewing of an algorithm's outcomes due to biases in its data or design, resulting in unfair favoritism towards certain groups or outcomes. |
| Preexisting Bias | Bias in the data before it's processed by an algorithm. If overlooked, it may affect the accuracy and fairness of algorithmic outcomes. |
| Sample Bias | A bias in which some elements of a population are underrepresented in the data sample. This can lead to inaccurate or skewed algorithmic outcomes. |
| Preparation Bias | Bias introduced during data preparation for an algorithm. It can result from decisions made during data cleaning, feature selection, or handling missing values, potentially skewing the algorithm's data input. |
| Interpretation Bias | Misinterpretation or overinterpretation of an algorithm's results, which can lead to incorrect conclusions or decisions. |
| Artificial Intelligence | The field focused on creating systems that emulate tasks requiring human intelligence. These tasks are typically driven by complex algorithms capable of understanding natural language, recognizing patterns, interpreting images, and making decisions. |

## Some Code to Know
| Operation in English | Code Snippet |
| --- | --- |
| Access the 'mpg' column of the dataframe 'mtcars_df'. | `mtcars_df['mpg']` |
| Retrieve the first 5 rows of the 'mpg', 'hp', and 'am' columns from the dataframe 'mtcars_df'. | `mtcars_df[['mpg', 'hp', 'am']].head()` |
| Get the first 5 rows of the dataframe 'mtcars_df' using index-based selection. | `mtcars_df.iloc[0:5]` |
| Select the row labeled 'Mazda RX4' from the dataframe 'mtcars_df'. | `mtcars_df.loc["Mazda RX4"]` |
| Find rows in the dataframe 'mtcars_df' where the 'mpg' value is greater than 25. | `mtcars_df[mtcars_df['mpg'] > 25]` |
| Find rows in 'mtcars_df' where 'mpg' is greater than 20 and 'am' equals 0. | `mtcars_df[(mtcars_df['mpg'] > 20) & (mtcars_df['am'] == 0)]` |
| Sort the dataframe 'mtcars_df' in ascending order by the 'mpg' column. | `mtcars_df.sort_values('mpg')` |
| Sort the dataframe 'mtcars_df' in descending order by the 'mpg' column. | `mtcars_df.sort_values('mpg', ascending=False)` |
| Sort the dataframe 'mtcars_df' by the 'cyl' and 'mpg' columns in ascending order and retrieve the first 5 rows. | `mtcars_df.sort_values(['cyl', 'mpg']).head()` |
| Define a function named 'multiply_by_two' that takes an argument 'x' and returns the result of 'x' multiplied by 2. | `def multiply_by_two(x): return x * 2` |
| Define a function named 'power_to_weight' that takes arguments 'hp' and 'wt' and returns the ratio of 'hp' to 'wt'. | `def power_to_weight(hp, wt): return hp / wt` |
| Create a new column in 'mtcars_df' named 'hp_to_mpg_ratio' that contains the ratio of 'hp' to 'mpg' for each row. | `mtcars_df['hp_to_mpg_ratio'] = mtcars_df['hp'] / mtcars_df['mpg']` |
| Assuming 'car_efficiency' is a function, apply it to the 'mpg' and 'cyl' columns of 'mtcars_df' and store in 'efficiency'. | `mtcars_df['efficiency'] = car_efficiency(mtcars_df['mpg'], mtcars_df['cyl'])` |