# Class 9: DataFrames continued

Plan for today:
- Review of pandas Series
- Continue with pandas DataFrames


## Notes on the class Jupyter setup

If you have the *ydata123_2023e* environment set up correctly, you can get the class code using the code below (which presumably you've already done given that you are seeing this notebook).  

In [2]:
import YData

# YData.download.download_class_code(9)   # get class code    

# YData.download.download_class_code(9, TRUE) # get the code with the answers 

YData.download.download_data("dow.csv")
YData.download.download_data("monthly_egg_prices.csv")
YData.download.download_data("monthly_wheat_prices.csv")
YData.download.download_data("nba_salaries_2015_16.csv")
YData.download.download_data("nba_position_names.csv")


The file `dow.csv` already exists.
If you would like to download a new copy of the file, please rename the existing copy of the file.
The file `monthly_egg_prices.csv` already exists.
If you would like to download a new copy of the file, please rename the existing copy of the file.
The file `monthly_wheat_prices.csv` already exists.
If you would like to download a new copy of the file, please rename the existing copy of the file.
The file `nba_salaries_2015_16.csv` already exists.
If you would like to download a new copy of the file, please rename the existing copy of the file.
The file `nba_position_names.csv` already exists.
If you would like to download a new copy of the file, please rename the existing copy of the file.


There are also similar functions to download the homework:

In [3]:
YData.download.download_homework(4)  # downloads the homework 

If you are using colabs, you should install polars and the YData packages by uncommenting and running the code below.

In [None]:
# !pip install https://github.com/emeyers/YData_package/tarball/master

If you are using google colabs, you should also uncomment and run the code below to mount the your google drive

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [3]:
import pandas as pd
import statistics
import numpy as np
from datetime import datetime

import matplotlib.pyplot as plt
%matplotlib inline

## Review of pandas Series

pandas Series are: One-dimensional ndarray with axis labels

pandas DataFrame are: Table data

Let's look at the egg and wheat price data...


In [4]:
egg_price_series = pd.read_csv("monthly_egg_prices.csv", parse_dates=True, index_col= "DATE").squeeze() 

# print the type
print(type(egg_price_series))

# print the shape
print(egg_price_series.shape)

# print the series
egg_price_series


<class 'pandas.core.series.Series'>
(516,)


DATE
1980-01-01    0.879
1980-02-01    0.774
1980-03-01    0.812
1980-04-01    0.797
1980-05-01    0.737
              ...  
2022-08-01    3.116
2022-09-01    2.902
2022-10-01    3.419
2022-11-01    3.589
2022-12-01    4.250
Name: Price, Length: 516, dtype: float64

## Warm-up problems

**Problem 1:** What was the price of eggs on December 1st, 1999?

In [5]:
egg_price_series.loc["1999-12-01"]

0.92

**Problem 2:**  What is the value of the 50th egg price in our Series of egg prices? 

In [6]:
egg_price_series.iloc[49]

1.324

In [7]:
# What is the average egg price over the whole data set? 
np.mean(egg_price_series)

1.297062015503876

**Problem 3:**  Can you calculate the average egg price since Jan 1st 2000? 

Hints: 
- You can access the index values using `my_df.index`
- Boolean masking could be useful (and here you can treat dates as strings)


In [8]:
boolean_dates_2000 = egg_price_series.index > "1999-12-31"

print(egg_price_series[boolean_dates_2000])

np.mean(egg_price_series[boolean_dates_2000])

DATE
2000-01-01    0.975
2000-02-01    0.962
2000-03-01    0.931
2000-04-01    0.939
2000-05-01    0.852
              ...  
2022-08-01    3.116
2022-09-01    2.902
2022-10-01    3.419
2022-11-01    3.589
2022-12-01    4.250
Name: Price, Length: 276, dtype: float64


1.6217173913043479

Recall, we can turn the index back into a column using `my_series.reset_index()`

This turns our Series into returns a DataFrame!

Let's explore DataFrames more now...

In [9]:
egg_price_df = egg_price_series.reset_index()

egg_price_df.head(3)

Unnamed: 0,DATE,Price
0,1980-01-01,0.879
1,1980-02-01,0.774
2,1980-03-01,0.812


## DataFrames!

The ability to manipulate data in tables (DataFrames) is one of the most useful skills in Data Science. 

Pandas is the most popular package in Python for manipulating data tables so we will use this package for manipulating tables in this class. The syntax for Pandas can be a little tricky, so try to be patient if you run into errors, and as always, there should be plenty of help available at office hours and on Ed. 

As an example, let's look at data on the closing price of the [Dow Jones Industrial Average](https://www.marketwatch.com/investing/index/djia) which is an index of the prices of the 30 largest corporations in the US.

The code below loads the DOW data into a Pandas DataFrame and displays the first 5 rows using the `head()` method. 


In [10]:
dow = pd.read_csv("dow.csv", parse_dates=True)  # parsing the dates didn't work

dow = dow.set_index("Date")

dow.head()  # we can get the last few rows using .tail()

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1/25/23,2023,1,Wednesday,33538.36,33773.09,33273.21,33743.84
1/24/23,2023,1,Tuesday,33444.72,33782.92,33310.56,33733.96
1/23/23,2023,1,Monday,33439.56,33782.88,33316.25,33629.56
1/20/23,2023,1,Friday,33073.46,33381.95,32948.93,33375.49
1/19/23,2023,1,Thursday,33171.35,33227.49,32982.05,33044.56


In [11]:
# get the shape, and dtypes

print(dow.shape)
print(dow.dtypes)


(10668, 7)
Year       int64
Month      int64
Day       object
Open     float64
High     float64
Low      float64
Close    float64
dtype: object


In [12]:
# get descriptive statistics on DataFrame using the .describe() method

dow.describe().astype("int")

Unnamed: 0,Year,Month,Open,High,Low,Close
count,10668,10668,10668,10668,10668,10668
mean,2001,6,10482,10563,10399,10484
std,12,3,8751,8797,8703,8752
min,1980,1,776,783,769,776
25%,1991,4,2920,2946,2892,2920
50%,2001,7,9763,9855,9661,9762
75%,2012,10,13347,13464,13255,13351
max,2023,12,36722,36952,36636,36799


### Selecting columns from a DataFrame

We can select columns from a DataFrame using the square brackets; e.g., `my_df["my_col"]`

If we'd like to select multiple columns we can pass a list; e.g., `my_df[["col1", "col2"]]`


In [13]:
# Get just the DOW close price

close_price = dow["Close"]

close_price.head()  # what is the type of close_price? (use type() and .dtype)


Date
1/25/23    33743.84
1/24/23    33733.96
1/23/23    33629.56
1/20/23    33375.49
1/19/23    33044.56
Name: Close, dtype: float64

In [14]:
# Get both the open and close price
open_close_price = dow[["Open", "Close"]]

open_close_price # what is the type of close_price? (use type() and .dtypes)

Unnamed: 0_level_0,Open,Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
1/25/23,33538.36,33743.84
1/24/23,33444.72,33733.96
1/23/23,33439.56,33629.56
1/20/23,33073.46,33375.49
1/19/23,33171.35,33044.56
...,...,...
10/8/80,963.98,963.98
10/7/80,960.67,960.67
10/6/80,965.70,965.70
10/3/80,950.69,950.69


### Getting a subset of rows from a DataFrame

Similar to pandas Series, we can get particular rows from a DataFrame using:

- `.loc`:  Get rows by Index values - and by Boolean masks
- `.iloc`.:  Get rows by their index number



In [15]:
# Extract a row based on the Index name "1/25/23"
dow.loc["1/25/23"]

Year          2023
Month            1
Day      Wednesday
Open      33538.36
High      33773.09
Low       33273.21
Close     33743.84
Name: 1/25/23, dtype: object

In [16]:
# Extract a row based on the row number (get row 0)
dow.iloc[0]

Year          2023
Month            1
Day      Wednesday
Open      33538.36
High      33773.09
Low       33273.21
Close     33743.84
Name: 1/25/23, dtype: object

In [17]:
# We can get multiple rows that meet particular conditions using Boolean masking

booleans_in_2022 = dow["Year"] == 2022

booleans_in_2022

Date
1/25/23    False
1/24/23    False
1/23/23    False
1/20/23    False
1/19/23    False
           ...  
10/8/80    False
10/7/80    False
10/6/80    False
10/3/80    False
10/2/80    False
Name: Year, Length: 10668, dtype: bool

In [18]:
# extract the 2022 values using our Boolean mask
dow.loc[booleans_in_2022]   # actually works even without the .loc

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
12/30/22,2022,12,Friday,33121.61,33152.55,32847.82,33147.25
12/29/22,2022,12,Thursday,33021.43,33293.42,33020.35,33220.80
12/28/22,2022,12,Wednesday,33264.76,33379.55,32869.15,32875.71
12/27/22,2022,12,Tuesday,33224.23,33387.72,33069.58,33241.56
12/23/22,2022,12,Friday,32961.06,33226.14,32814.02,33203.93
...,...,...,...,...,...,...,...
1/7/22,2022,1,Friday,36249.59,36382.84,36111.53,36231.66
1/6/22,2022,1,Thursday,36409.05,36464.19,36200.68,36236.47
1/5/22,2022,1,Wednesday,36722.60,36952.65,36400.39,36407.11
1/4/22,2022,1,Tuesday,36636.00,36934.84,36636.00,36799.65


In [19]:
# Can you get the mean DOW close value in 2022? 

data_2022 = dow[dow.Year == 2022]

print(data_2022["Close"].mean())   # using the Series mean() function

np.mean(data_2022["Close"])  # can also use np.mean()



32897.345179282864


32897.345179282864

### Sorting values in a DataFrame

We can sort values in a DataFrame using `.sort_values("col_name")`

We can sort from highest to lowest by setting the argument `ascending = False`


In [20]:
# Sort the data by the Close value
dow.sort_values("Close").head()

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
8/12/82,1982,8,Thursday,776.91,786.14,773.21,776.91
8/11/82,1982,8,Wednesday,777.2,783.96,772.16,777.2
8/10/82,1982,8,Tuesday,779.29,789.09,775.68,779.29
8/9/82,1982,8,Monday,780.34,784.33,769.97,780.34
8/6/82,1982,8,Friday,784.33,798.98,781.76,784.33


In [21]:
# What is the highest the DOW has been? 
dow.sort_values("Close", ascending = False).head()

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1/4/22,2022,1,Tuesday,36636.0,36934.84,36636.0,36799.65
1/3/22,2022,1,Monday,36321.59,36595.82,36246.45,36585.06
12/29/21,2021,12,Wednesday,36421.14,36571.55,36396.19,36488.63
11/8/21,2021,11,Monday,36416.46,36565.73,36334.42,36432.22
1/5/22,2022,1,Wednesday,36722.6,36952.65,36400.39,36407.11


### Adding new columns to a Data Frame

We can add a column to a data frame using square backets. For example: 

- `my_df["new col"] = my_df["col1"] + my_df["col2"]`.




Percent change is defined as: $100 * \frac{final - initial}{initial}$

Can you add a "Percent change" column to the dow2 data (which is a copy of the dow data comparing closing and opening prices?  What is the biggest percent change in the dow? 

In [22]:
# copy the data to dow2
dow2 = dow.copy()

# add percent change column
dow2["Percent change"] = 100 * (dow2["Close"] - dow2["Open"])/dow2["Open"]

# sort the data
dow2.sort_values("Percent change").head()

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close,Percent change
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
10/9/08,2008,10,Thursday,9381.96,9522.77,8523.27,8579.19,-8.556528
9/17/01,2001,9,Monday,9580.32,9294.55,8755.46,8920.7,-6.885156
10/15/08,2008,10,Wednesday,9145.24,9278.25,8516.5,8577.91,-6.203555
10/7/08,2008,10,Tuesday,10030.69,10205.04,9391.67,9447.11,-5.817945
4/14/00,2000,4,Friday,10922.85,10890.9,10172.67,10305.77,-5.649441


In [23]:
# sort the data from largest to smallest
dow2.sort_values("Percent change", ascending = False).head() 

# This is actually not historically correct for older dates. 
# See if you can figure out how to calculate the actual largest percent changes. 

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close,Percent change
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
10/28/08,2008,10,Tuesday,8401.65,9112.51,8153.79,9065.12,7.896901
3/23/09,2009,3,Monday,7279.25,7780.72,7278.78,7775.86,6.822269
7/24/02,2002,7,Wednesday,7698.46,8243.07,7489.54,8191.29,6.40167
11/13/08,2008,11,Thursday,8321.21,8898.41,7947.74,8835.25,6.177467
10/13/08,2008,10,Monday,8871.97,9501.91,8638.6,9387.61,5.812012


We can rename columns by:
1. Creating a `rename_dictionary` dictionary that maps old column names to new column names
2. By passing this dictionary to the `my_df.rename(columns = rename_dictionary)` method

In [24]:
# Rename the Percent change column
rename_dictionary = {"Percent change": "Woot"}
dow2.rename(columns = rename_dictionary)

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close,Woot
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1/25/23,2023,1,Wednesday,33538.36,33773.09,33273.21,33743.84,0.612672
1/24/23,2023,1,Tuesday,33444.72,33782.92,33310.56,33733.96,0.864830
1/23/23,2023,1,Monday,33439.56,33782.88,33316.25,33629.56,0.568189
1/20/23,2023,1,Friday,33073.46,33381.95,32948.93,33375.49,0.913210
1/19/23,2023,1,Thursday,33171.35,33227.49,32982.05,33044.56,-0.382227
...,...,...,...,...,...,...,...,...
10/8/80,1980,10,Wednesday,963.98,971.42,955.30,963.98,0.000000
10/7/80,1980,10,Tuesday,960.67,973.05,955.55,960.67,0.000000
10/6/80,1980,10,Monday,965.70,969.63,950.84,965.70,0.000000
10/3/80,1980,10,Friday,950.69,957.84,938.41,950.69,0.000000


### Getting aggregate statistics by group

We can get aggregate statistics by group using `groupby()` and `agg` methods using the following syntax:

`my_df.groupby("col_name").agg("agg_function_name")`

Can you get the max values of the DOW each year? 


In [25]:
# What was the max values of the DOW each year? 

dow[["Year", "Close"]].groupby("Year").agg("max")


Unnamed: 0_level_0,Close
Year,Unnamed: 1_level_1
1980,1000.16
1981,1024.04
1982,1070.55
1983,1287.19
1984,1286.64
1985,1553.1
1986,1955.57
1987,2722.42
1988,2183.5
1989,2791.41


There are several ways to get multiple statistics by group. Perhaps the most useful way is to use the syntax:

<pre>
my_df.groupby("group_col_name").agg(
   new_col1 = ('col_name', 'statistic_name1'),
   new_col2 = ('col_name', 'statistic_name2'),
   new_col3 = ('col_name', 'statistic_name3')
)
</pre>


Let's create a DataFrame that has the number of trading days, the minimum and the maximum DOW value for each year. 


In [26]:
dow.groupby('Year').agg(
    countClose = ('Close', 'count'),
    minClose = ('Close', 'min'),
    maxClose=('Close', 'max')
)

Unnamed: 0_level_0,countClose,minClose,maxClose
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1980,62,908.44,1000.16
1981,253,824.0,1024.04
1982,253,776.91,1070.55
1983,253,1027.04,1287.19
1984,253,1086.57,1286.64
1985,252,1184.95,1553.1
1986,253,1502.29,1955.57
1987,253,1738.74,2722.42
1988,253,1879.14,2183.5
1989,252,2144.64,2791.41


### "Joining" DataFrames by Index

To explore joining DataFrames, let's load the egg and wheat prices as DataFrames. 

We will also:
- Rename the Price colomns to Egg Price and Wheat Price
- Set the Index to be the date


When two DataFrames have the same Index values, we can use the `.join()` method to join them.

In [76]:
# load the egg and wheat prices as DataFrames
egg_price_df = pd.read_csv("monthly_egg_prices.csv", parse_dates=True, index_col= "DATE")
egg_price_df = egg_price_df.rename(columns = {"Price":"Egg Price"})
egg_price_df.head(3)

Unnamed: 0_level_0,Egg Price
DATE,Unnamed: 1_level_1
1980-01-01,0.879
1980-02-01,0.774
1980-03-01,0.812


In [77]:
wheat_price_df = pd.read_csv("monthly_wheat_prices.csv", parse_dates=True, index_col= "DATE")
wheat_price_df = wheat_price_df.rename(columns = {"Price":"Wheat Price"})
wheat_price_df.head(3)

Unnamed: 0_level_0,Wheat Price
DATE,Unnamed: 1_level_1
1990-01-01,167.918579
1990-02-01,160.937271
1990-03-01,156.52803


In [78]:
# Let's do a left join by setting how = "left"
# This will give same results as an outer join b/c the egg_price_df has all (and more) index values as the wheat_prices_df
left_joined = egg_price_df.join(wheat_price_df, how = "left") 
left_joined

Unnamed: 0_level_0,Egg Price,Wheat Price
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1
1980-01-01,0.879,
1980-02-01,0.774,
1980-03-01,0.812,
1980-04-01,0.797,
1980-05-01,0.737,
...,...,...
2022-08-01,3.116,323.016769
2022-09-01,2.902,346.322181
2022-10-01,3.419,353.712907
2022-11-01,3.589,344.329861


In [79]:
# Let's do a right join by setting how = "right"  
# This will give same results as an inner join b/c the egg_price_df has all (and more) index values as the wheat_prices_df
right_joined = egg_price_df.join(wheat_price_df, how = "right") 
right_joined

Unnamed: 0_level_0,Egg Price,Wheat Price
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1
1990-01-01,1.223,167.918579
1990-02-01,1.041,160.937271
1990-03-01,1.111,156.528030
1990-04-01,1.092,159.467529
1990-05-01,0.940,149.179291
...,...,...
2022-07-01,2.936,321.975128
2022-08-01,3.116,323.016769
2022-09-01,2.902,346.322181
2022-10-01,3.419,353.712907


### "Merging" DataFrames by column values

If we want to join by value in a column rather than by Index value we can use the `.merge()` method (which is very similar to the `.join()` method). 


In [80]:
egg_price_df2 = egg_price_df.reset_index()
egg_price_df2.head(3)

Unnamed: 0,DATE,Egg Price
0,1980-01-01,0.879
1,1980-02-01,0.774
2,1980-03-01,0.812


In [81]:
wheat_price_df2 = wheat_price_df.reset_index()

wheat_price_df2.head(3)

Unnamed: 0,DATE,Wheat Price
0,1990-01-01,167.918579
1,1990-02-01,160.937271
2,1990-03-01,156.52803


In [82]:
left_joined2 = egg_price_df2.merge(wheat_price_df2, how = "left") 
left_joined2

Unnamed: 0,DATE,Egg Price,Wheat Price
0,1980-01-01,0.879,
1,1980-02-01,0.774,
2,1980-03-01,0.812,
3,1980-04-01,0.797,
4,1980-05-01,0.737,
...,...,...,...
511,2022-08-01,3.116,323.016769
512,2022-09-01,2.902,346.322181
513,2022-10-01,3.419,353.712907
514,2022-11-01,3.589,344.329861


#### Merging with different column names

What if the columns we want to join on have different names, we can use the `left_on` and `right_on` arguments to specify which columns (i.e., keys) should be used to align the two DataFrames

In [83]:
egg_price_df3 = egg_price_df2.rename(columns = {"DATE":"Egg DATE"})
wheat_price_df3 = wheat_price_df2.rename(columns = {"DATE": "Wheat DATE"})

wheat_price_df3.head(3)


Unnamed: 0,Wheat DATE,Wheat Price
0,1990-01-01,167.918579
1,1990-02-01,160.937271
2,1990-03-01,156.52803


In [84]:
egg_price_df3.head(3)

Unnamed: 0,Egg DATE,Egg Price
0,1980-01-01,0.879
1,1980-02-01,0.774
2,1980-03-01,0.812


In [86]:
left_joined3 = egg_price_df3.merge(wheat_price_df3, how = "left", left_on = "Egg DATE", right_on = "Wheat DATE") 
left_joined3

Unnamed: 0,Egg DATE,Egg Price,Wheat DATE,Wheat Price
0,1980-01-01,0.879,NaT,
1,1980-02-01,0.774,NaT,
2,1980-03-01,0.812,NaT,
3,1980-04-01,0.797,NaT,
4,1980-05-01,0.737,NaT,
...,...,...,...,...
511,2022-08-01,3.116,2022-08-01,323.016769
512,2022-09-01,2.902,2022-09-01,346.322181
513,2022-10-01,3.419,2022-10-01,353.712907
514,2022-11-01,3.589,2022-11-01,344.329861


#### Example: Spelling out NBA position names

As you will recall, our NBA salaries DataFrame had the different positions listed as abbreviations such as "C" and "PG". 

Often it is hard to tell what these abbreviations (or codes) mean, so a common use of joining is to join on to a table a list of longer names that give more meaning to abbreviations. 

Below we load our `nba_salaries` DataFrame along with a `nba_positions` DataFrame which has information about how each position abbreviation maps on to the position's full name.

Let's merge these DataFrames together so that our `nba_salaries` DataFrame has the full position names!



In [87]:
nba_salaries = pd.read_csv("nba_salaries_2015_16.csv")

nba_salaries.head(3)


Unnamed: 0,PLAYER,POSITION,TEAM,SALARY
0,Paul Millsap,PF,Atlanta Hawks,18.671659
1,Al Horford,C,Atlanta Hawks,12.0
2,Tiago Splitter,C,Atlanta Hawks,9.75625


In [88]:
nba_positions = pd.read_csv("nba_position_names.csv")
nba_positions

Unnamed: 0,Position Abbreviation,Position Name
0,PF,Point Guard
1,SG,Shooting Guard
2,C,Center
3,SF,Small Forward
4,PF,Power Forward


In [91]:
# merge the DataFrames together so each player's position is the full position name

nba_improved = nba_salaries.merge(nba_positions, left_on = "POSITION", right_on = "Position Abbreviation")

nba_improved.head(5)

Unnamed: 0,PLAYER,POSITION,TEAM,SALARY,Position Abbreviation,Position Name
0,Paul Millsap,PF,Atlanta Hawks,18.671659,PF,Point Guard
1,Paul Millsap,PF,Atlanta Hawks,18.671659,PF,Power Forward
2,Mike Scott,PF,Atlanta Hawks,3.333333,PF,Point Guard
3,Mike Scott,PF,Atlanta Hawks,3.333333,PF,Power Forward
4,Jonas Jerebko,PF,Boston Celtics,5.0,PF,Point Guard


In [94]:
# remove unnecessary columns using the .drop(colums = )  method
nba_improved.drop(columns = ["POSITION", "Position Abbreviation"])

Unnamed: 0,PLAYER,TEAM,SALARY,Position Name
0,Paul Millsap,Atlanta Hawks,18.671659,Point Guard
1,Paul Millsap,Atlanta Hawks,18.671659,Power Forward
2,Mike Scott,Atlanta Hawks,3.333333,Point Guard
3,Mike Scott,Atlanta Hawks,3.333333,Power Forward
4,Jonas Jerebko,Boston Celtics,5.000000,Point Guard
...,...,...,...,...
412,Chris Johnson,Utah Jazz,0.981348,Small Forward
413,Martell Webster,Washington Wizards,5.613500,Small Forward
414,Otto Porter Jr.,Washington Wizards,4.662960,Small Forward
415,Jared Dudley,Washington Wizards,4.375000,Small Forward


![pandas](https://image.goat.com/transform/v1/attachments/product_template_additional_pictures/images/071/445/310/original/719082_01.jpg.jpeg)