# Week 3: Data Manipulation with Pandas
### Introduction to Pandas
Pandas is a Python **library** for working with data in a tabular (table-like) structure called **DataFrames**. A dataframe is similar to an Excel spreadsheet, which is organized by **columns** and **rows**. 
#### Key features in Pandas:
- **DataFrames**: store 2D data (rows and columns)
- **Series**: store 1D arrays
- **Data Manipulation**: you can easily add, remove, modify, and access data
- **Descriptive statistics**: you can easily calculate statistics such as mean, sums, counts, and more
---

### Importing a library
To access the objects in a library, you must import it to your notebook using the `import`
<br><br>
**Note**: you need to "call" the library every time you want to use an object, so you can use `import library as abbreviation` to use another shorter name for your library

In [107]:
# Import the library 'Pandas' and use the alias 'pd'


### Working with DataFrames
---
#### *Creating a DataFrame*
To create a **Pandas DataFrame** you can use the class ```pandas.DataFrame(data=None, index=None, columns=None)``` from the pandas library
<br><br>
1. Create DataFrames using Python **lists**

In [None]:
# Let's see what happens when we use lists
my_list = ['one','two','three','four']

In [None]:
# Now, what would happen if we use a list of lists?
list_of_lists = [['one','two','three'],['a', 'b', 'c'],[1,2,3]]


The top row are you **column** names. If not specified, it will use indexes (integers)
<br>
The left most column are your **index** (row) names. If not specified, it will use indexes (integers)

In [None]:
# Use the method .columns or .index to access the columns or indexes


In [None]:
# We can define the column and index names by:
# (1) calling your DataFrame and using the method 'columns' or 'index'


# (2) passing the column and index labels as arguments


In [None]:
# Let see our new indexes and columns


In [None]:
# If the lists have different lengths, the DataFrame will be created with NaN or None values


In [None]:
# if the number of columns or indexes do not match the number of elements in the list, an error will be raised


2. Create DataFrames with Python **dicitonaries**

In [None]:
# Create a dictionary with lists
dict1 = {
    'names' : ['Alice', 'Bob', 'Charlie'],
    'ages' : [25, 30, 35],
    'nationalities' : ['American', 'British', 'Australian'],
    'sports' : ['Tennis', 'Soccer', 'Basketball']
}



In [None]:
# If using dictionaries, the length of the lists must be the same or an error will occur
dict2 = {
    'names' : ['Alice', 'Bob', 'Charlie'],
    'ages' : [25, 30, 35],
    'nationalities' : ['American', 'British', 'Australian'],
    'sports' : ['Tennis', 'Soccer']
}


Now, lets create a DataFrame to learn how to extract and manipulate data

In [None]:
# Lets create a DataFrame using a dictionary
player_data = {
    "Name": ["LeBron James", "Stephen Curry", "Kevin Durant"],
    "Team": ["Lakers", "Warriors", "Suns"],
    "Points Per Game": [27.2, 24.6, 26.9],
}


---
#### *Accessing Data from a DataFrame*
1. Access specific columns using []
2. Access specific rows using `.iloc[i]` or `.loc`

In [None]:
# Using [] after your DataFrame will focus on columns


# Using .iloc[] you can access a row using indexes


# The method .loc[] is used if your rows have specific labels
# Let's use of df2 


3. Access a specific cell using `DataFrame[Column][Index]`

In [None]:
# First specify the column, then specify the row index



# It will also work with row labels



4. Access a subset of DataFrame using conditions
You can use a condition to get a subset of your DataFrame using [conditon]

In [None]:
# Example: lets get a subset of our DataFrame for the players that scored more than 25 points


# Example: Get players than are from the Bucks team
# In Python == is used state 'equal'


---
#### *Adding Data to a DataFrame*
You can add new columns or rows to a DataFrame

In [None]:
# Specify a new column and assign (=) values


In [None]:
# You can use .loc[] method to create a new row and assign (=) values


You can use `pd.concat([df1, df2])` to combine two DataFrames

In [None]:
# Lets add new players to our DataFrame
new_players = {
    "Name": ["Joel Embiid", "Luka Dončić"],
    "Team": ["76ers", "Mavericks"],
    "Points Per Game": [30.6, 32.4],
    "Height (cm)": [213, 201]
}

# Create a DataFrame for your new players


# Combine both dataframes



---
#### *Basic DataFrame Information*

| Method | Output |
|-|-|
|`len(df)`| # rows|
|`df.shape`|(# rows, # columns)|
|`df.count()`| Number of non-NA values in each column|


In [None]:
# Let's try them 
# Use len() for # of rows

# Create a variable with shape for (#row, #columns) tuple


# Show how many values you have in each column .count()


---
#### *Summary/Statistics*
You can get descriptive statistics from for DataFrames. However, most will only make sense for columns with integers or floats

|Method|Output|
|-|-|
|`df.sum()`|Sum of values of each column|
|`df.cumsum()`|Cumulative sum of values|
|`df.min()` or `df.max()`|Minimum or maximum values|
|`df.describe()`|Summary statistics|
|`df.mean()`|Mean of values|
|`df.median()`|Meadian of values|



In [None]:
# You can get summary of all columns or specify a column using brackets


#Try the rest


### Exercise: Manipulating data using DataFrames
Use the following dictionary to create a DataFrame and answer the questions

In [126]:
basketball_stats = {
    "Player": [
        "John Smith", "Mike Johnson", "Alex Brown", "Chris Davis", "James Wilson",
        "Daniel Lee", "Ryan Harris", "Kevin White", "David Martin", "Brian Anderson"
    ],
    "Team": [
        "Lions", "Lions", "Lions", "Lions", "Lions",
        "Tigers", "Tigers", "Tigers", "Tigers", "Tigers"
    ],
    "Points": [25, 18, 12, 20, 22, 15, 10, 30, 27, 14],
    "Assists": [5, 7, 4, 6, 3, 8, 9, 2, 5, 6],
    "Height_cm": [198, 195, 200, 205, 190, 185, 192, 198, 203, 187],
    "Weight_kg": [95, 92, 100, 110, 88, 85, 90, 97, 105, 83],
}

In [None]:
# Create a Pandas DataFrame using the dictionary above


1. How many players in your DataFrame are from the Tigers and how many are from the Lions?

In [None]:
# Code for 1


2. Who are the tallest and shortest players from the DataFrame?

**Hint:** use min() and max() and then use values to find the rows

In [None]:
# Code for 2


3. Which team accumulated the highest amount of points?

**Hint:** Use the .sum() method and determine which values is the highest

In [None]:
# Code for 3


4. What is the mean weight for each team?

In [None]:
# Code for 4


5. Which players have the most assists?

In [None]:
# Code for 5
