# Activity: Perform feature engineering 

## **Introduction**


As you're learning, data professionals working on modeling projects use featuring engineering to help them determine which attributes in the data can best predict certain measures.

In this activity, you are working for a firm that provides insights to the National Basketball Association (NBA), a professional North American basketball league. You will help NBA managers and coaches identify which players are most likely to thrive in the high-pressure environment of professional basketball and help the team be successful over time.

To do this, you will analyze a subset of data that contains information about NBA players and their performance records. You will conduct feature engineering to determine which features will most effectively predict whether a player's NBA career will last at least five years. The insights gained then will be used in the next stage of the project: building the predictive model.


## **Step 1: Imports** 


Start by importing `pandas`.

In [4]:
# Import pandas.

import pandas as pd
import numpy as np


The dataset is a .csv file named `nba-players.csv`. It consists of performance records for a subset of NBA players. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [2]:
# RUN THIS CELL TO IMPORT YOUR DATA.

# Save in a variable named `data`.

### YOUR CODE HERE ###

data = pd.read_csv("nba-players.csv", index_col=0)

<details><summary><h4><strong>Hint 1</strong></h4></summary>

The `read_csv()` function from `pandas` allows you to read in data from a csv file and load it into a DataFrame.
    
</details>

<details><summary><h4><strong>Hint 2</strong></h4></summary>

Call the `read_csv()`, pass in the name of the csv file as a string, followed by `index_col=0` to use the first column from the csv as the index in the DataFrame.
    
</details>

## **Step 2: Data exploration** 

Display the first 10 rows of the data to get a sense of what it entails.

In [3]:
# Display first 10 rows of data.

data.head()


Unnamed: 0,name,gp,min,pts,fgm,fga,fg,3p_made,3pa,3p,...,fta,ft,oreb,dreb,reb,ast,stl,blk,tov,target_5yrs
0,Brandon Ingram,36,27.4,7.4,2.6,7.6,34.7,0.5,2.1,25.0,...,2.3,69.9,0.7,3.4,4.1,1.9,0.4,0.4,1.3,0
1,Andrew Harrison,35,26.9,7.2,2.0,6.7,29.6,0.7,2.8,23.5,...,3.4,76.5,0.5,2.0,2.4,3.7,1.1,0.5,1.6,0
2,JaKarr Sampson,74,15.3,5.2,2.0,4.7,42.2,0.4,1.7,24.4,...,1.3,67.0,0.5,1.7,2.2,1.0,0.5,0.3,1.0,0
3,Malik Sealy,58,11.6,5.7,2.3,5.5,42.6,0.1,0.5,22.6,...,1.3,68.9,1.0,0.9,1.9,0.8,0.6,0.1,1.0,1
4,Matt Geiger,48,11.5,4.5,1.6,3.0,52.4,0.0,0.1,0.0,...,1.9,67.4,1.0,1.5,2.5,0.3,0.3,0.4,0.8,1


<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

There is a function in the `pandas` library that can be called on a DataFrame to display the first n number of rows, where n is a number of your choice. 
</details>

<details>
<summary><h4><strong>Hint 2</strong></h4></summary>

Call the `head()` function and pass in 10.
</details>

Display the number of rows and the number of columns to get a sense of how much data is available to you.

In [6]:
# Display number of rows, number of columns.

data.shape

(1340, 21)

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

DataFrames in `pandas` have an attribute that can be called to get the number of rows and columns as a tuple.
</details>

<details>
<summary><h4><strong>Hint 2</strong></h4></summary>

You can call the `shape` attribute.
</details>

**Question:** What do you observe about the number of rows and the number of columns in the data?

 [Write your response here. Double-click (or enter) to edit.]

Now, display all column names to get a sense of the kinds of metadata available about each player. Use the columns property in pandas.


In [7]:
# Display all column names.

data.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1340 entries, 0 to 1339
Data columns (total 21 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   name         1340 non-null   object 
 1   gp           1340 non-null   int64  
 2   min          1340 non-null   float64
 3   pts          1340 non-null   float64
 4   fgm          1340 non-null   float64
 5   fga          1340 non-null   float64
 6   fg           1340 non-null   float64
 7   3p_made      1340 non-null   float64
 8   3pa          1340 non-null   float64
 9   3p           1340 non-null   float64
 10  ftm          1340 non-null   float64
 11  fta          1340 non-null   float64
 12  ft           1340 non-null   float64
 13  oreb         1340 non-null   float64
 14  dreb         1340 non-null   float64
 15  reb          1340 non-null   float64
 16  ast          1340 non-null   float64
 17  stl          1340 non-null   float64
 18  blk          1340 non-null   float64
 19  tov   

The following table provides a description of the data in each column. This metadata comes from the data source, which is listed in the references section of this lab.

<center>

|Column Name|Column Description|
|:---|:-------|
|`name`|Name of NBA player|
|`gp`|Number of games played|
|`min`|Number of minutes played per game|
|`pts`|Average number of points per game|
|`fgm`|Average number of field goals made per game|
|`fga`|Average number of field goal attempts per game|
|`fg`|Average percent of field goals made per game|
|`3p_made`|Average number of three-point field goals made per game|
|`3pa`|Average number of three-point field goal attempts per game|
|`3p`|Average percent of three-point field goals made per game|
|`ftm`|Average number of free throws made per game|
|`fta`|Average number of free throw attempts per game|
|`ft`|Average percent of free throws made per game|
|`oreb`|Average number of offensive rebounds per game|
|`dreb`|Average number of defensive rebounds per game|
|`reb`|Average number of rebounds per game|
|`ast`|Average number of assists per game|
|`stl`|Average number of steals per game|
|`blk`|Average number of blocks per game|
|`tov`|Average number of turnovers per game|
|`target_5yrs`|1 if career duration >= 5 yrs, 0 otherwise|

</center>

Next, display a summary of the data to get additional information about the DataFrame, including the types of data in the columns.

In [8]:
# Use .info() to display a summary of the DataFrame.

data.describe()


Unnamed: 0,gp,min,pts,fgm,fga,fg,3p_made,3pa,3p,ftm,fta,ft,oreb,dreb,reb,ast,stl,blk,tov,target_5yrs
count,1340.0,1340.0,1340.0,1340.0,1340.0,1340.0,1340.0,1340.0,1340.0,1340.0,1340.0,1340.0,1340.0,1340.0,1340.0,1340.0,1340.0,1340.0,1340.0,1340.0
mean,60.414179,17.624627,6.801493,2.629104,5.885299,44.169403,0.247612,0.779179,19.149627,1.297687,1.82194,70.300299,1.009403,2.025746,3.034478,1.550522,0.618507,0.368582,1.193582,0.620149
std,17.433992,8.307964,4.357545,1.683555,3.593488,6.137679,0.383688,1.061847,16.051861,0.987246,1.322984,10.578479,0.777119,1.360008,2.057774,1.471169,0.409759,0.429049,0.722541,0.485531
min,11.0,3.1,0.7,0.3,0.8,23.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.3,0.0,0.0,0.0,0.1,0.0
25%,47.0,10.875,3.7,1.4,3.3,40.2,0.0,0.0,0.0,0.6,0.9,64.7,0.4,1.0,1.5,0.6,0.3,0.1,0.7,0.0
50%,63.0,16.1,5.55,2.1,4.8,44.1,0.1,0.3,22.2,1.0,1.5,71.25,0.8,1.7,2.5,1.1,0.5,0.2,1.0,1.0
75%,77.0,22.9,8.8,3.4,7.5,47.9,0.4,1.2,32.5,1.6,2.3,77.6,1.4,2.6,4.0,2.0,0.8,0.5,1.5,1.0
max,82.0,40.9,28.2,10.2,19.8,73.7,2.3,6.5,100.0,7.7,10.2,100.0,5.3,9.6,13.9,10.6,2.5,3.9,4.4,1.0


**Question:** Based on the preceding tables, which columns are numerical and which columns are categorical?

 [Write your response here. Double-click (or enter) to edit.]

### Check for missing values

Now, review the data to determine whether it contains any missing values. Begin by displaying the number of missing values in each column. After that, use isna() to check whether each value in the data is missing. Finally, use sum() to aggregate the number of missing values per column.


In [13]:
# Display the number of missing values in each column.
# Check whether each value is missing.
#Aggregate the number of missing values per column.

data.isna().sum()
data = data.dropna()

**Question:** What do you observe about the missing values in the columns? 

 [Write your response here. Double-click (or enter) to edit.]

**Question:** Why is it important to check for missing values?

 Write your response here. Double-click (or enter) to edit.

## **Step 3: Statistical tests** 



Next, use a statistical technique to check the class balance in the data. To understand how balanced the dataset is in terms of class, display the percentage of values that belong to each class in the target column. In this context, class 1 indicates an NBA career duration of at least five years, while class 0 indicates an NBA career duration of less than five years.

In [22]:
# Display percentage (%) of values for each class (1, 0) represented in the target column of this dataset.

count_career_greater = data['target_5yrs'].sum()
count_career_less = data.shape[0] - count_career_greater
greater_career_percentage = (count_career_greater/data.shape[0]) * 100
less_career_percentage = (count_career_less/data.shape[0]) * 100

print('count of players with career greater than 5 years ' + str(count_career_greater))
print('count of players with career less than 5 years ' + str(count_career_less))
print('Percentage of players with 5+ years: ' + str(greater_career_percentage))
print('Percentage of players with less than 5 years: ' + str(less_career_percentage))

count of players with career greater than 5 years 831
count of players with career less than 5 years 509
Percentage of players with 5+ years: 62.01492537313433
Percentage of players with less than 5 years: 37.985074626865675


<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

In `pandas`, `value_counts(normalize=True)` can be used to calculate the frequency of each distinct value in a specific column of a DataFrame.  
</details>

<details>
<summary><h4><strong>Hint 2</strong></h4></summary>

After `value_counts(normalize=True)`, multipling by `100` converts the frequencies into percentages (%).
</details>

**Question:** What do you observe about the class balance in the target column?

 [Write your response here. Double-click (or enter) to edit.]

**Question:** Why is it important to check class balance?

Write your response here. Double-click (or enter) to edit.

## **Step 4: Results and evaluation** 


Now, perform feature engineering, with the goal of identifying and creating features that will serve as useful predictors for the target variable, `target_5yrs`. 

### Feature selection

The following table contains descriptions of the data in each column:

<center>

|Column Name|Column Description|
|:---|:-------|
|`name`|Name of NBA player|
|`gp`|Number of games played|
|`min`|Number of minutes played|
|`pts`|Average number of points per game|
|`fgm`|Average number of field goals made per game|
|`fga`|Average number of field goal attempts per game|
|`fg`|Average percent of field goals made per game|
|`3p_made`|Average number of three-point field goals made per game|
|`3pa`|Average number of three-point field goal attempts per game|
|`3p`|Average percent of three-point field goals made per game|
|`ftm`|Average number of free throws made per game|
|`fta`|Average number of free throw attempts per game|
|`ft`|Average percent of free throws made per game|
|`oreb`|Average number of offensive rebounds per game|
|`dreb`|Average number of defensive rebounds per game|
|`reb`|Average number of rebounds per game|
|`ast`|Average number of assists per game|
|`stl`|Average number of steals per game|
|`blk`|Average number of blocks per game|
|`tov`|Average number of turnovers per game|
|`target_5yrs`|1 if career duration >= 5 yrs, 0 otherwise|

</center>

**Question:** Which columns would you select and avoid selecting as features, and why? Keep in mind the goal is to identify features that will serve as useful predictors for the target variable, `target_5yrs`. 

 [Write your response here. Double-click (or enter) to edit.]

Next, select the columns you want to proceed with. Make sure to include the target column, `target_5yrs`. Display the first few rows to confirm they are as expected.

In [26]:
# Select the columns to proceed with and save the DataFrame in new variable `selected_data`.
# Include the target column, `target_5yrs`.

selected_data = data[['gp', 'pts', 'reb', 'ast', 'target_5yrs']]


# Display the first few rows.

selected_data.head()



Unnamed: 0,gp,pts,reb,ast,target_5yrs
0,36,7.4,4.1,1.9,0
1,35,7.2,2.4,3.7,0
2,74,5.2,2.2,1.0,0
3,58,5.7,1.9,0.8,1
4,48,4.5,2.5,0.3,1


<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Refer to the materials about feature selection and selecting a subset of a DataFrame.
</details>

<details>
<summary><h4><strong>Hint 2</strong></h4></summary>

Use two pairs of square brackets, and place the names of the columns you want to select inside the innermost brackets. 

</details>

<details>
<summary><h4><strong>Hint 3</strong></h4></summary>

There is a function in `pandas` that can be used to display the first few rows of a DataFrame. Make sure to specify the column names with spelling that matches what's in the data. Use quotes to represent each column name as a string. 
</details>

### Feature transformation

An important aspect of feature transformation is feature encoding. If there are categorical columns that you would want to use as features, those columns should be transformed to be numerical. This technique is also known as feature encoding.

**Question:** Why is feature transformation important to consider? Are there any transformations necessary for the features you want to use?

 [Write your response here. Double-click (or enter) to edit.]

### Feature extraction

Display the first few rows containing containing descriptions of the data for reference. The table is as follows:

<center>

|Column Name|Column Description|
|:---|:-------|
|`name`|Name of NBA player|
|`gp`|Number of games played|
|`min`|Number of minutes played per game|
|`pts`|Average number of points per game|
|`fgm`|Average number of field goals made per game|
|`fga`|Average number of field goal attempts per game|
|`fg`|Average percent of field goals made per game|
|`3p_made`|Average number of three-point field goals made per game|
|`3pa`|Average number of three-point field goal attempts per game|
|`3p`|Average percent of three-point field goals made per game|
|`ftm`|Average number of free throws made per game|
|`fta`|Average number of free throw attempts per game|
|`ft`|Average percent of free throws made per game|
|`oreb`|Average number of offensive rebounds per game|
|`dreb`|Average number of defensive rebounds per game|
|`reb`|Average number of rebounds per game|
|`ast`|Average number of assists per game|
|`stl`|Average number of steals per game|
|`blk`|Average number of blocks per game|
|`tov`|Average number of turnovers per game|
|`target_5yrs`|1 if career duration >= 5 yrs, 0 otherwise|

</center>

In [27]:
# Display the first few rows of `selected_data` for reference.

selected_data.head()



Unnamed: 0,gp,pts,reb,ast,target_5yrs
0,36,7.4,4.1,1.9,0
1,35,7.2,2.4,3.7,0
2,74,5.2,2.2,1.0,0
3,58,5.7,1.9,0.8,1
4,48,4.5,2.5,0.3,1


**Question:** Which columns lend themselves to feature extraction?

 [Write your response here. Double-click (or enter) to edit.]

Extract two features that you think would help predict `target_5yrs`. Then, create a new variable named 'extracted_data' that contains features from 'selected_data', as well as the features being extracted.

In [28]:
# Extract two features that would help predict target_5yrs.
# Create a new variable named `extracted_data`.

extracted_data = selected_data[['gp', 'pts', 'reb', 'target_5yrs']]





<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Refer to the materials about feature extraction.
</details>

<details>
<summary><h4><strong>Hint 2</strong></h4></summary>

Use the function `copy()` to make a copy of a DataFrame. To access a specific column from a DataFrame, use a pair of square brackets and place the name of the column as a string inside the brackets.

</details>

<details>
<summary><h4><strong>Hint 3</strong></h4></summary>

Use a pair of square brackets to create a new column in a DataFrame. The columns in DataFrames are series objects, which support elementwise operations such as multiplication and division. Be sure the column names referenced in your code match the spelling of what's in the DataFrame.
</details>

Now, to prepare for the Naive Bayes model that you will build in a later lab, clean the extracted data and ensure ensure it is concise. Naive Bayes involves an assumption that features are independent of each other given the class. In order to satisfy that criteria, if certain features are aggregated to yield new features, it may be necessary to remove those original features. Therefore, drop the columns that were used to extract new features.

**Note:** There are other types of models that do not involve independence assumptions, so this would not be required in those instances. In fact, keeping the original features may be beneficial.

In [None]:
# Remove any columns from `extracted_data` that are no longer needed.

### YOUR CODE HERE ###


# Display the first few rows of `extracted_data` to ensure that column drops took place.

### YOUR CODE HERE ###



<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Refer to the materials about feature extraction.
</details>

<details>
<summary><h4><strong>Hint 2</strong></h4></summary>

There are functions in the `pandas` library that remove specific columns from a DataFrame and that display the first few rows of a DataFrame.
</details>

<details>
<summary><h4><strong>Hint 3</strong></h4></summary>

Use the `drop()` function and pass in a list of the names of the columns you want to remove. By default, calling this function will result in a new DataFrame that reflects the changes you made. The original DataFrame is not automatically altered. You can reassign `extracted_data` to the result, in order to update it. 

Use the `head()` function to display the first few rows of a DataFrame.
</details>

Next, export the extracted data as a new .csv file. You will use this in a later lab. 

In [29]:
# Export the extracted data.

extracted_data.to_csv()


',gp,pts,reb,target_5yrs\n0,36,7.4,4.1,0\n1,35,7.2,2.4,0\n2,74,5.2,2.2,0\n3,58,5.7,1.9,1\n4,48,4.5,2.5,1\n5,75,3.7,0.8,0\n6,62,6.6,2.0,1\n7,48,5.7,1.7,1\n8,65,2.4,0.8,0\n9,42,3.7,1.1,0\n10,35,2.3,0.9,0\n11,40,3.6,1.2,1\n12,27,1.3,2.0,1\n13,45,5.6,2.0,0\n14,44,2.4,1.4,1\n15,40,2.6,0.4,1\n16,49,2.1,1.2,0\n17,41,1.7,0.3,0\n18,82,19.2,11.0,0\n19,82,19.2,11.0,1\n20,80,14.3,8.0,1\n21,82,13.3,5.1,1\n22,76,10.6,2.9,1\n23,61,12.0,3.1,0\n24,32,6.3,5.2,0\n25,76,10.4,5.9,1\n26,52,9.3,1.9,0\n27,76,8.8,5.1,1\n28,78,10.1,3.0,0\n29,51,8.4,5.4,1\n30,64,6.2,6.4,1\n31,55,10.4,6.1,1\n32,82,7.4,2.0,1\n33,48,9.1,5.7,1\n34,34,4.3,1.4,0\n35,42,8.5,2.0,1\n36,82,5.5,1.7,1\n37,64,7.0,2.0,1\n38,80,7.4,4.2,1\n39,77,5.6,3.8,1\n40,55,4.1,1.5,0\n41,48,5.1,1.7,0\n42,51,5.0,1.5,0\n43,51,3.9,2.1,0\n44,72,4.6,1.3,1\n45,69,3.6,2.7,1\n46,54,5.7,3.4,1\n47,24,4.1,0.7,0\n48,50,4.1,1.1,1\n49,56,3.4,1.2,1\n50,67,3.7,0.7,0\n51,50,2.5,0.8,1\n52,45,2.7,1.5,1\n53,38,3.4,0.9,0\n54,42,2.8,1.9,0\n55,54,3.2,1.3,0\n56,35,4.0,1.2,0\n57,3

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

There is a function in the `pandas` library that exports a DataFrame as a .csv file. 
</details>

<details>
<summary><h4><strong>Hint 2</strong></h4></summary>

Use the `to_csv()` function to export the DataFrame as a .csv file. 
</details>

<details>
<summary><h4><strong>Hint 3</strong></h4></summary>

Call the `to_csv()` function on `extracted_data`, and pass in the name that you want to give to the resulting .csv file. Specify the file name as a string and in the file name. Make sure to include `.csv` as the file extension. Also, pass in the parameter `index` set to `0`, so that when the export occurs, the row indices from the DataFrame are not treated as an additional column in the resulting file. 
</details>

## **Considerations**


**What are some key takeaways that you learned during this lab? Consider the process you followed and what tasks were performed during each step, as well as important priorities when training data.**

 [Write your response here. Double-click (or enter) to edit.]

**What summary would you provide to stakeholders? Consider key attributes to be shared from the data, as well as upcoming project plans.**

 [Write your response here. Double-click (or enter) to edit.]

**Congratulations!** You've completed this lab. However, you may not notice a green check mark next to this item on Coursera's platform. Please continue your progress regardless of the check mark. Just click on the "save" icon at the top of this notebook to ensure your work has been logged.