<a href="https://colab.research.google.com/github/brendanpshea/data-science/blob/main/Data_Science_02_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Python Intermediate: Data Manipulation with Pandas
## Brendan Shea, PhD
Building upon the foundation laid in Chapter 1, we delve further into the world of Python and pandas in this next chapter. Having introduced the mtcars dataset and explored some of its basic features, we now focus on mastering more advanced data manipulation techniques. Python and pandas offer us the tools to filter, sort, and transform our dataset in various ways, enabling us to extract increasingly sophisticated insights from it.

In this chapter, we continue to work with the mtcars dataset, a collection of data points about various car models. Its diverse features allow us to explore a myriad of data manipulation techniques, sharpening our skills and broadening our understanding of the practical application of Python and pandas.

However, as we delve deeper into these techniques, it's not just about the how, but also the why. What's going on behind these functions and methods we're using? The answer lies in algorithms, the heart of many operations in pandas. An algorithm is like a recipe, guiding the computer step by step to achieve our intended result.

In this chapter, we will unmask the role of algorithms in our data exploration, bringing them to the forefront of our learning journey. We will highlight examples of algorithms in the operations we use to manipulate the mtcars dataset and discuss their impact on the results of our data analysis.

As we reveal the machinery of algorithmic processing, we will also reflect on the philosophical implications of using algorithms. They can be powerful and efficient tools but also carry risks such as biases. Thus, understanding algorithms extends beyond the technical: it invites us to engage with important ethical questions.

By the end of this chapter, you will have developed a deeper understanding of data manipulation techniques, gained a clearer perspective on the role and influence of algorithms, and cultivated a sense of the ethical considerations in data science. Let's continue our journey into the fascinating world of Python and pandas.

In [1]:
!pip install pydataset -q # Install required packages
from pydataset import data # Import required modules
import pandas as pd # More on this below

mtcars_df = data('mtcars') # Load the mtcars dataset

mtcars_df.head() # display first five rows

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.9/15.9 MB[0m [31m36.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for pydataset (setup.py) ... [?25l[?25hdone
initiated datasets repo at: /root/.pydataset/


Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


## Revisiting the mtcars Dataset

The mtcars dataset is a valuable resource for our study due to its wide range of attributes relating to various aspects of automobile design and performance. Each row in the dataset represents a different car model from the 1973-74 Motor Trend magazine issues, and each column represents a different attribute, such as miles per gallon (mpg), number of cylinders (cyl), and horsepower (hp), among others.

To reacquaint ourselves with the mtcars dataset, we begin by importing the pandas library and loading the dataset. Once loaded, we can view the first few rows of the DataFrame using the `head()` function to get a quick overview of the data structure.



The mtcars dataset is composed of several columns, each representing a different feature:

-   mpg: Miles/(US) gallon
-   cyl: Number of cylinders
-   disp: Displacement (cu.in.)
-   hp: Gross horsepower
-   drat: Rear axle ratio
-   wt: Weight (1000 lbs)
-   qsec: 1/4 mile time
-   vs: V/S
-   am: Transmission (0 = automatic, 1 = manual)
-   gear: Number of forward gears
-   carb: Number of carburetors

The `describe()` and `info()` functions are powerful tools for exploring the DataFrame at a more detailed level. The `describe()` function provides summary statistics for each column, such as mean, standard deviation, and quartile values, giving a mathematical overview of the dataset. On the other hand, `info()` provides a concise summary of the DataFrame, including the number of non-null entries in each column and their data types, which is particularly useful in identifying missing values and ensuring that each column contains the appropriate data type for our analysis.

Remember, understanding your data is the first step to effective data analysis. With these tools at hand, you're well equipped to begin digging deeper into the mtcars dataset.

In [2]:
round(mtcars_df.describe(), 2)

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
count,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0
mean,20.09,6.19,230.72,146.69,3.6,3.22,17.85,0.44,0.41,3.69,2.81
std,6.03,1.79,123.94,68.56,0.53,0.98,1.79,0.5,0.5,0.74,1.62
min,10.4,4.0,71.1,52.0,2.76,1.51,14.5,0.0,0.0,3.0,1.0
25%,15.42,4.0,120.82,96.5,3.08,2.58,16.89,0.0,0.0,3.0,2.0
50%,19.2,6.0,196.3,123.0,3.7,3.32,17.71,0.0,0.0,4.0,2.0
75%,22.8,8.0,326.0,180.0,3.92,3.61,18.9,1.0,1.0,4.0,4.0
max,33.9,8.0,472.0,335.0,4.93,5.42,22.9,1.0,1.0,5.0,8.0


## Selecting and Filtering Data in pandas

The process of data analysis often involves sifting through large amounts of information to find what's most relevant to your specific task or question. This is where data selection and filtering come into play. In data analysis, **filtering** refers to the process of specifying conditions to isolate a subset of your data that meets those conditions. For example, you might be interested in analyzing only those rows where a certain value exceeds a defined threshold or those columns that represent specific variables of interest. This enables us to focus on the most pertinent data and ignore extraneous details, streamlining our analysis process and reducing computational load.

Pandas provides a wide array of functionalities for data selection and filtering, allowing us to extract precisely what we need from a DataFrame.

### Column Selection:

You can select columns simply by referring to their names. For example, you might want to select the 'mpg' (miles per gallon) column from the `mtcars` DataFrame:

In [3]:
# Selecting a single column (limit to 5 rows)
mtcars_df['mpg'].head()

Mazda RX4            21.0
Mazda RX4 Wag        21.0
Datsun 710           22.8
Hornet 4 Drive       21.4
Hornet Sportabout    18.7
Name: mpg, dtype: float64

You can also pass multiple rows by passing in their names:

In [4]:
# Selecting multiple columns
mtcars_df[['mpg', 'hp', 'am']].head()

Unnamed: 0,mpg,hp,am
Mazda RX4,21.0,110,1
Mazda RX4 Wag,21.0,110,1
Datsun 710,22.8,93,1
Hornet 4 Drive,21.4,110,0
Hornet Sportabout,18.7,175,0


### Row Selection

Row selection is performed using either the `iloc` or `loc` function. `iloc` is used for index-based selection, while `loc` is used for label-based selection.

For example, to select the first five rows:

In [5]:
# Selecting first five rows using index-based selection
mtcars_df.iloc[0:5]


Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


We can also select rows by "label":

In [6]:
# Selecting rows by label
mtcars_df.loc["Mazda RX4"]


mpg      21.00
cyl       6.00
disp    160.00
hp      110.00
drat      3.90
wt        2.62
qsec     16.46
vs        0.00
am        1.00
gear      4.00
carb      4.00
Name: Mazda RX4, dtype: float64

### Conditional Filtering

Pandas allows us to filter our data based on specific conditions. For example, we might want to select cars from our mtcars DataFrame that have a miles per gallon (`mpg`) rating greater than 25:

In [7]:
# Filtering rows where mpg > 25
mtcars_df[mtcars_df['mpg'] > 25]

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Fiat 128,32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1
Honda Civic,30.4,4,75.7,52,4.93,1.615,18.52,1,1,4,2
Toyota Corolla,33.9,4,71.1,65,4.22,1.835,19.9,1,1,4,1
Fiat X1-9,27.3,4,79.0,66,4.08,1.935,18.9,1,1,4,1
Porsche 914-2,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2
Lotus Europa,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2


We can combine multiple conditions using the `&` (and) or `|` (or) operators. For example, to select cars with an mpg greater than 20 and an automatic transmission:

In [8]:
# Filtering rows where mpg > 20 and the car has an automatic transmission
mtcars_df[(mtcars_df['mpg'] > 20) & (mtcars_df['am'] == 0)]

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
Toyota Corona,21.5,4,120.1,97,3.7,2.465,20.01,1,0,3,1


Remember to enclose each condition in parentheses when combining them, to ensure that each condition is properly evaluated.

Mastering these data selection and filtering methods in pandas will enable you to perform more targeted and efficient data analysis. They provide a way to zoom in on the most relevant parts of your data, making your analyses more precise and insightful.

## Sorting Data in pandas

Order is often an important factor in data analysis. By sorting our data, we can quickly identify patterns, outliers, or specific entries that might otherwise get lost in an unordered dataset. In pandas, we have the `sort_values()` function, which enables us to arrange our DataFrame in ascending or descending order based on the values of one or more columns.

The `sort_values()` function takes as arguments the names of the columns you want to sort by. By default, the function sorts in ascending order, but you can reverse this with the `ascending=False` option.

For example, to sort the `mtcars` DataFrame by miles per gallon (mpg) in ascending order, you would write:

In [9]:
# Sorting by 'mpg' in ascending order (limit to 5)
mtcars_df.sort_values('mpg').head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Lincoln Continental,10.4,8,460.0,215,3.0,5.424,17.82,0,0,3,4
Cadillac Fleetwood,10.4,8,472.0,205,2.93,5.25,17.98,0,0,3,4
Camaro Z28,13.3,8,350.0,245,3.73,3.84,15.41,0,0,3,4
Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
Chrysler Imperial,14.7,8,440.0,230,3.23,5.345,17.42,0,0,3,4


To sort in descending order:

In [10]:
# Sorting by 'mpg' in descending order
mtcars_df.sort_values('mpg', ascending=False).head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Toyota Corolla,33.9,4,71.1,65,4.22,1.835,19.9,1,1,4,1
Fiat 128,32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1
Lotus Europa,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2
Honda Civic,30.4,4,75.7,52,4.93,1.615,18.52,1,1,4,2
Fiat X1-9,27.3,4,79.0,66,4.08,1.935,18.9,1,1,4,1


You can also sort by multiple columns. When sorting by multiple columns, the function sorts the DataFrame based on the first column first, then sorts within those results based on the second column, and so on. For example, to sort by both the number of cylinders (`cyl`) and `mpg`, you would write:

In [11]:
# Sorting by 'cyl' and then 'mpg' in ascending order (limit to 5)
mtcars_df.sort_values(['cyl', 'mpg']).head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Volvo 142E,21.4,4,121.0,109,4.11,2.78,18.6,1,1,4,2
Toyota Corona,21.5,4,120.1,97,3.7,2.465,20.01,1,0,3,1
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2


Using `sort_values()` function effectively can greatly enhance your ability to understand and interpret your data. By arranging your data in a meaningful order, you make it easier to find specific entries and observe patterns, outliers, or trends.

Exercises: Searching and Sorting

1.   Using the `mtcars` DataFrame, select the rows for cars that have more than 200 horsepower (`hp`).

    *Hint: Remember how we used conditional filtering to select rows based on a specific condition?*

2.  From the `mtcars` DataFrame, select the cars that weigh more than 3,000 lbs (`wt`) and have an automatic transmission (`am`=0).

    *Hint: You can combine multiple conditions using the `&` (and) operator.*

3.  Select the columns `gear`, `carb`, and `wt` from the `mtcars` DataFrame.

    *Hint: For selecting multiple columns, you need to pass a list of column names.*

4.  Select the 5th to the 10th rows from the `mtcars` DataFrame.

    *Hint: You can use index-based selection with `iloc` for this task.*

5.  Sort the `mtcars` DataFrame in ascending order based on the `qsec` (1/4 mile time) column.

    *Hint: You can use the `sort_values()` function and specify the column you want to sort by.*

6.  Sort the `mtcars` DataFrame in descending order based on the `disp` (displacement) column.

    *Hint: To sort in descending order, set `ascending=False` in the `sort_values()` function.*

7.  Sort the `mtcars` DataFrame first by the `gear` column and then within each gear group, sort by the `carb` (number of carburetors) column.

    *Hint: To sort by multiple columns, pass a list of column names to the `sort_values()` function.*

In [12]:
# Ex 1

In [13]:
# Ex 2

In [14]:
# Ex 3

In [15]:
# Ex 4

In [16]:
# Ex 5

In [17]:
# Ex 6

In [18]:
# Ex 7

## Introduction to Functions and Algorithms

Functions are fundamental building blocks in Python and many other programming languages. Essentially, a function is a reusable piece of code that performs a specific task. It takes input in the form of arguments and returns a result based on these inputs.

Here's a simple function that multiplies a number by two:

In [19]:
def multiply_by_two(x):
    return x * 2

In this function, `x` is the input, and `x * 2` is the output. We can call this function with any numeric input we like, for example:

In [20]:
print(multiply_by_two(5))  # Outputs: 10

10


Functions are closely tied to the concept of algorithms. In the most general sense, an **algorithm** is a step-by-step procedure to perform a specific task or solve a particular problem. Therefore, you can think of functions as implementations of algorithms - they take inputs, perform specific steps (the algorithm), and provide outputs.

For instance, imagine we want to calculate the power-to-weight ratio for cars in the `mtcars` DataFrame. The power-to-weight ratio is a measure of performance and is calculated as horsepower divided by weight. We can create a function to perform this calculation:

In [21]:
def power_to_weight(hp, wt):
    return hp / wt

With this function, we can calculate the power-to-weight ratio for any car, given its horsepower and weight:

In [22]:
print(power_to_weight(110, 2.875))  # Outputs: 38.26 (approximately)

38.26086956521739


This function, `power_to_weight`, is an implementation of an algorithm - it takes two inputs (horsepower and weight), performs a specific calculation (division), and returns a result (the power-to-weight ratio).

Now, let's briefly look at a few more algorithms.

### Algorithm: Calculating Car Efficiency

When evaluating a car, one common measure is the fuel efficiency. We can define an algorithm to compute a fuel efficiency score by taking into account miles per gallon (mpg) and the number of cylinders (cyl). The more mpg and fewer cylinders, the more efficient a car is considered to be.

In [23]:
def car_efficiency(mpg, cyl):
    # Compute the efficiency score as mpg divided by the number of cylinders
    return mpg / cyl

### Algorithm: Analyzing Car Performance

Another aspect to consider when evaluating a car is its performance. For this, we can create a performance score based on horsepower (hp), quarter mile time (qsec), and weight (wt). More horsepower and less quarter mile time indicate better performance, but we'll normalize by weight to account for the effect of the car's mass.

In [24]:
def car_performance(hp, qsec, wt):
    # Compute the performance score as horsepower divided by quarter mile time, normalized by weight
    return (hp / qsec) / wt

### Algorithm: Categorizing Car Type

Let's imagine we want to categorize the cars in our dataset. To do so, we can define an algorithm that examines several features of each car and assigns it to one of the categories based on its characteristics.

Consider three categories: 'Sports Car', 'Grocery Getter', and 'Roadster'. We define the categories based on the following criteria:

1. 'Sports Car': Cars with an automatic transmission (am=1), more than 4 cylinders (cyl>4), and horsepower greater than 150 (hp>150). These are cars built for speed and power.

2. 'Grocery Getter': Cars with high miles per gallon (mpg>20) and less horsepower (hp < 100). These cars are optimized for efficiency and practicality.

3. 'Roadster': The remaining cars that do not fit into the other categories. These could be cars that balance performance and efficiency, offering a versatile driving experience.

Here is how we can implement this categorization algorithm in Python:

In [32]:
def categorize_car(cyl, am, hp, mpg):
    # Check if the car meets the criteria to be considered a sports car
    if cyl > 4 and am == 1 and hp > 150:
        return 'Sports Car'
    # Check if the car meets the criteria to be considered a grocery getter
    elif mpg > 20 and hp < 100:
        return 'Grocery Getter'
    # All remaining cars are considered roadsters
    else:
        return 'Roadster'


## Creating New Columns
Pandas DataFrame provides a straightforward way to create new columns based on operations on existing columns. Let's explore how to do this.

If we want to create a new column that is the 'mpg' (miles per gallon) column multiplied by 2, we could do this:

In [26]:
mtcars_df['double_mpg'] = mtcars_df['mpg'] * 2
mtcars_df.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,double_mpg
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4,42.0
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4,42.0
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1,45.6
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1,42.8
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2,37.4


Here's what's happening:

-   `mtcars_df['mpg']` selects the 'mpg' column of our DataFrame.
-   `* 2` multiplies every value in the 'mpg' column by 2.
-   `mtcars_df['double_mpg'] =` creates a new column in our DataFrame called 'double_mpg' that stores the results.

After running this code, every value in the 'double_mpg' column will be twice the corresponding value in the 'mpg' column.

You can do the same for division. For example, to create a new column that is the ratio of 'hp' (horsepower) to 'mpg', you could do this:

In [27]:
mtcars_df['hp_to_mpg_ratio'] = mtcars_df['hp'] / mtcars_df['mpg']
mtcars_df.head()


Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,double_mpg,hp_to_mpg_ratio
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4,42.0,5.238095
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4,42.0,5.238095
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1,45.6,4.078947
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1,42.8,5.140187
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2,37.4,9.358289


In this line of code, every value in the 'hp' column is divided by the corresponding value in the 'mpg' column, and the result is stored in a new column named 'hp_to_mpg_ratio'.

### Using Functions to Create New Columns
We've written a few custom functions to calculate car efficiency, power-to-weight ratio, car performance, and categorize car type. These are a bit more complex than simply multiplying or dividing columns. However, we can still use them to create new columns in our DataFrame. We just have to call the function and pass in the appropriate column(s) as arguments.

Here's how we can do this:

**Car Efficiency:** We defined this as miles per gallon divided by the number of cylinders. We can create a new 'efficiency' column using our `car_efficiency()` function and the 'mpg' and 'cyl' columns:

In [28]:
mtcars_df['efficiency'] = car_efficiency(mtcars_df['mpg'], mtcars_df['cyl'])
mtcars_df.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,double_mpg,hp_to_mpg_ratio,efficiency
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4,42.0,5.238095,3.5
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4,42.0,5.238095,3.5
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1,45.6,4.078947,5.7
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1,42.8,5.140187,3.566667
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2,37.4,9.358289,2.3375


**Power-to-Weight Ratio:** This is the horsepower divided by the weight. We can create a new 'power_to_weight' column using our `power_to_weight()` function and the 'hp' and 'wt' columns:

In [29]:
mtcars_df['power_to_weight'] = power_to_weight(mtcars_df['hp'], mtcars_df['wt'])
mtcars_df.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,double_mpg,hp_to_mpg_ratio,efficiency,power_to_weight
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4,42.0,5.238095,3.5,41.984733
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4,42.0,5.238095,3.5,38.26087
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1,45.6,4.078947,5.7,40.086207
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1,42.8,5.140187,3.566667,34.214619
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2,37.4,9.358289,2.3375,50.872093


**Car Performance:** We defined this as horsepower divided by the product of quarter mile time and weight. We can create a new 'performance' column using our `car_performance()` function and the 'hp', 'qsec', and 'wt' columns:

In [30]:
mtcars_df['performance'] = car_performance(mtcars_df['hpmt'], mtcars_df['qsec'], mtcars_df['wt'])
mtcars_df.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,double_mpg,hp_to_mpg_ratio,efficiency,power_to_weight,performance
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4,42.0,5.238095,3.5,41.984733,2.550713
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4,42.0,5.238095,3.5,38.26087,2.247995
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1,45.6,4.078947,5.7,40.086207,2.154014
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1,42.8,5.140187,3.566667,34.214619,1.760011
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2,37.4,9.358289,2.3375,50.872093,2.98896


**Car Type:** We categorized cars as 'Sports Car', 'Grocery Getter', or 'Hill Climber' based on their cylinders, transmission, horsepower, and mpg. We can create a new 'car_type' column using our categorize_car() function and the appropriate columns.

You'll note here we need to use `lambda` (which allows us to define short unnamed functions) and `df.apply()` (which allows us to APPLY functions). WHile you don't need to know the details of these (yet), they are powerful tool when working with more complex functions and algorithms.

In [34]:
mtcars_df['car_type'] = mtcars_df.apply(lambda x: categorize_car(x['cyl'], x['am'], x['hp'], x['mpg']), axis=1)
mtcars_df.head()


Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,double_mpg,hp_to_mpg_ratio,efficiency,power_to_weight,performance,car_type
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4,42.0,5.238095,3.5,41.984733,2.550713,Roadster
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4,42.0,5.238095,3.5,38.26087,2.247995,Roadster
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1,45.6,4.078947,5.7,40.086207,2.154014,Grocery Getter
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1,42.8,5.140187,3.566667,34.214619,1.760011,Roadster
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2,37.4,9.358289,2.3375,50.872093,2.98896,Roadster


Here's what the code does:

-   The `lambda x:` part defines a new temporary function with `x` as an argument.
-   `x` in this context is a row of the DataFrame (because we specified `axis=1` in `apply()`).
-   `categorize_car(x['cyl'], x['am'], x['hp'], x['mpg'])` is the function we want to apply to each row.

So for each row in `mtcars_df`, we're applying `categorize_car()` to the 'cyl', 'am', 'hp', and 'mpg' columns of that row, and the result is stored in the new 'car_type' column.