# Week 3: Data Manipulation with Pandas
### Introduction to Pandas
Pandas is a Python **library** for working with data in a tabular (table-like) structure called **DataFrames**. A dataframe is similar to an Excel spreadsheet, which is organized by **columns** and **rows**. 
#### Key features in Pandas:
- **DataFrames**: store 2D data (rows and columns)
- **Series**: store 1D arrays
- **Data Manipulation**: you can easily add, remove, modify, and access data
- **Descriptive statistics**: you can easily calculate statistics such as mean, sums, counts, and more
---

### Importing a library
To access the objects in a library, you must import it to your notebook using the `import`
<br><br>
**Note**: you need to "call" the library every time you want to use an object, so you can use `import library as abbreviation` to use another shorter name for your library

In [107]:
# Import the library 'Pandas' and use the alias 'pd'
import pandas as pd # Now you can use 'pd' to refer to 'pandas'

### Working with DataFrames
---
#### *Creating a DataFrame*
To create a **Pandas DataFrame** you can use the class ```pandas.DataFrame(data=None, index=None, columns=None)``` from the pandas library
<br><br>
1. Create DataFrames using Python **lists**

In [None]:
# Let's see what happens when we use lists
my_list = ['one','two','three','four']
df1 = pd.DataFrame(my_list)
df1 # Each element on the list becomes a row in a DataFrame

In [None]:
# Now, what would happen if we use a list of lists?
list_of_lists = [['one','two','three'],['a', 'b', 'c'],[1,2,3]]
df2 = pd.DataFrame(list_of_lists)
df2 # Each list becomes a row in a DataFrame

The top row are you **column** names. If not specified, it will use indexes (integers)
<br>
The left most column are your **index** (row) names. If not specified, it will use indexes (integers)

In [None]:
# Use the method .columns or .index to access the columns or indexes
print(df2.columns)
print(df2.index)

In [None]:
# We can define the column and index names by:
# (1) calling your DataFrame and using the method 'columns' or 'index'
column_labels = ['col1','col2','col3']
index_labels = ['row1','row2','row3']
df2.columns = column_labels
df2.index = index_labels
print(df2)

# (2) passing the column and index labels as arguments
df3 = pd.DataFrame(list_of_lists, columns=column_labels, index=index_labels)
print(df3)

In [None]:
# Let see our new indexes and columns
print(df2.columns)
print(df2.index)

In [None]:
# If the lists have different lengths, the DataFrame will be created with NaN or None values
pd.DataFrame([['one','two','three'],['a', 'b', 'c'],[1,2]]) 

In [None]:
# if the number of columns or indexes do not match the number of elements in the list, an error will be raised
pd.DataFrame(list_of_lists, columns=column_names, index=['row1','row2']) # This will raise an error

2. Create DataFrames with Python **dicitonaries**

In [None]:
# Create a dictionary with lists
dict1 = {
    'names' : ['Alice', 'Bob', 'Charlie'],
    'ages' : [25, 30, 35],
    'nationalities' : ['American', 'British', 'Australian'],
    'sports' : ['Tennis', 'Soccer', 'Basketball']
}

df4 = pd.DataFrame(dict1)
df4 # The keys of the dictionary become the column labels & the values become the rows

In [None]:
# If using dictionaries, the length of the lists must be the same or an error will occur
dict2 = {
    'names' : ['Alice', 'Bob', 'Charlie'],
    'ages' : [25, 30, 35],
    'nationalities' : ['American', 'British', 'Australian'],
    'sports' : ['Tennis', 'Soccer']
}
pd.DataFrame(dict2) # This will raise an error

Now, lets create a DataFrame to learn how to extract and manipulate data

In [None]:
# Lets create a DataFrame using a dictionary
player_data = {
    "Name": ["LeBron James", "Stephen Curry", "Kevin Durant"],
    "Team": ["Lakers", "Warriors", "Suns"],
    "Points Per Game": [27.2, 24.6, 26.9],
}
df = pd.DataFrame(player_data)
print(df)

---
#### *Accessing Data from a DataFrame*
1. Access specific columns using []
2. Access specific rows using `.iloc[i]` or `.loc`

In [None]:
# Using [] after your DataFrame will focus on columns
print(df['Name'], '\n')

# Using .iloc[] you can access a row using indexes
print(df.iloc[0], '\n')

# The method .loc[] is used if your rows have specific labels
# Let's use of df2 
print(df2)
df2.loc['row1']

3. Access a specific cell using `DataFrame[Column][Index]`

In [None]:
# First specify the column, then specify the row index
print(df['Name'][0])


# It will also work with row labels
print(df2['col1']['row1'])


4. Access a subset of DataFrame using conditions
You can use a condition to get a subset of your DataFrame using [conditon]

In [None]:
# Example: lets get a subset of our DataFrame for the players that scored more than 25 points
df_subset = df[df['Points Per Game'] > 25]
print(df_subset)

# Example: Get players than are from the Bucks team
# In Python == is used state 'equal'
df_bucks = df[df['Team'] == 'Bucks']
print('\n',df_bucks) 

---
#### *Adding Data to a DataFrame*
You can add new columns or rows to a DataFrame

In [None]:
# Specify a new column and assign (=) values
df["Height (cm)"] = [206, 191, 208]
print(df)

In [None]:
# You can use .loc[] method to create a new row and assign (=) values
df.loc[3] = ['Giannis Antetokounmpo', 'Bucks', 29.9, 211]
df

You can use `pd.concat([df1, df2])` to combine two DataFrames

In [None]:
# Lets add new players to our DataFrame
new_players = {
    "Name": ["Joel Embiid", "Luka Dončić"],
    "Team": ["76ers", "Mavericks"],
    "Points Per Game": [30.6, 32.4],
    "Height (cm)": [213, 201]
}

# Create a DataFrame for your new players
df_new = pd.DataFrame(new_players)

# Combine both dataframes
df = pd.concat([df, df_new], ignore_index=True)
df


---
#### *Basic DataFrame Information*

| Method | Output |
|-|-|
|`len(df)`| # rows|
|`df.shape`|(# rows, # columns)|
|`df.count()`| Number of non-NA values in each column|


In [None]:
# Let's try them 
print('There are', len(df), 'players in my DataFrame')

# Create a variable
df_shape = df.shape
# df_shape[0] is number of rows, df_shape[1] is number of columns
print('I have', df_shape[1], 'pieces of information from', df_shape[0], 'players')

# Show how many values you have in each column
print(df.count())

---
#### *Summary/Statistics*
You can get descriptive statistics from for DataFrames. However, most will only make sense for columns with integers or floats

|Method|Output|
|-|-|
|`df.sum()`|Sum of values of each column|
|`df.cumsum()`|Cumulative sum of values|
|`df.min()` or `df.max()`|Minimum or maximum values|
|`df.describe()`|Summary statistics|
|`df.mean()`|Mean of values|
|`df.median()`|Meadian of values|



In [None]:
# You can get summary of all columns or specify a column using brackets
print(df.sum())
print('\nTotal points:',df['Points Per Game'].sum())

#Try the rest
print('\ncumsum', df.cumsum())
print('\nThe lowest score was:', df['Points Per Game'].min())
print('\nThe mean Points Per Game is:', df['Points Per Game'].mean())

### Exercise: Manipulating data using DataFrames
Use the following dictionary to create a DataFrame and answer the questions

In [126]:
basketball_stats = {
    "Player": [
        "John Smith", "Mike Johnson", "Alex Brown", "Chris Davis", "James Wilson",
        "Daniel Lee", "Ryan Harris", "Kevin White", "David Martin", "Brian Anderson"
    ],
    "Team": [
        "Lions", "Lions", "Lions", "Lions", "Lions",
        "Tigers", "Tigers", "Tigers", "Tigers", "Tigers"
    ],
    "Points": [25, 18, 12, 20, 22, 15, 10, 30, 27, 14],
    "Assists": [5, 7, 4, 6, 3, 8, 9, 2, 5, 6],
    "Height_cm": [198, 195, 200, 205, 190, 185, 192, 198, 203, 187],
    "Weight_kg": [95, 92, 100, 110, 88, 85, 90, 97, 105, 83],
}

In [None]:
# Create a Pandas DataFrame using the dictionary above
df = pd.DataFrame(basketball_stats)
df

1. How many players in your DataFrame are from the Tigers and how many are from the Lions?

In [None]:
# Code for 1

# Get a subset df for each team
df_tigers = df[df['Team'] == 'Tigers']
df_lions = df[df['Team'] == 'Lions']
df_tigers # check that it is correct


# The number of rows in each DataFrame correspond to the number of players
n_tigers = len(df_tigers)   # or df_tigers.shape[0]
n_lions = df_lions.shape[0] # or len(df_lions)


# Print your results
print('There are', n_tigers, 'players from the Tigers team and', n_lions, 'players from the Lions team')

2. Who are the tallest and shortest players from the DataFrame?

**Hint:** use min() and max() and then use values to find the rows

In [None]:
# Code for 2

# Find the maximum and minimum height values
max_height = df['Height_cm'].max()
min_height = df['Height_cm'].min()


# Find the row that contains the max/min height values
row_max_height = df[df['Height_cm'] == max_height]
row_min_height = df[df['Height_cm'] == min_height]

# Get the player's name
player_max_height = row_max_height['Player']  # The output will be a pandas series
player_min_height = row_min_height['Player']

# To print it, you can turn the pandas series in a list using list() and then get the first item [0]
print('The tallest player is', list(player_max_height)[0], 'and the shortest player is', list(player_min_height)[0])

3. Which team accumulated the highest amount of points?

**Hint:** Use the .sum() method and determine which values is the highest

In [None]:
# Code for 3

# Get the total points for each team using the subset DataFrames
total_points_tigers = df_tigers['Points'].sum()
total_points_lions = df_lions['Points'].sum()

# (1) Print totals and determine which team has the highest value and print results
print(total_points_tigers, total_points_lions)
print('The Lions scored more points in total')

# (2) Use If statements
if total_points_lions < total_points_tigers:
    print('The Tigers scored more points in total')
else:
    print('The Lions scored more points in total')

4. What is the mean weight for each team?

In [None]:
# Code for 4

# Find the mean for the weight columns in the subset DataFrames
mean_weight_tigers = df_tigers['Weight_kg'].mean()
mean_weight_lions = df_lions['Weight_kg'].mean()

# Print the results
print('The mean weight for the Tigers is:', mean_weight_tigers, 'kg')
print('The mean weight for the Lions is:', mean_weight_lions, 'kg')

5. Which players have the most assists?

In [None]:
# Code for 5

# Find the maximum value for assists
max_assist = df['Assists'].max()

# Find the row with the maximum assist value
row_max_assist = df[df['Assists'] == max_assist]

# Get the player's name from the row
player_max_assist = row_max_assist['Player']

# print the results (turn series into list and get first element)
print('The player with the most assists is:', list(player_max_assist)[0])