## 1. Import Pandas & create DataFrame or Series

There are a few ways of creating a `Dataframe`/`Series`

### 1. From a Dictionary `{key : value}` 
Good for when your data is organized by columns.

### 2. From a List of Lists `[[0, 1], [0, 1]]` 
Good for when your data is organized by rows.

### 3. From a File 
The most common way in practice, using `pd.read_csv()`. 
    
* The `delimiter='\t'` argument is used for files that don't use commas to separate values (like Tab-Separated Value or `.tsv` files).

In [9]:
import pandas as pd

# Method 1: From a dictionary
data = {
  'city': ['Brooklyn', 'Seoul', 'Barcelona', 'Mexico City'],
  'country': ['US', 'South Korea', 'Spain', 'Mexico'],
  'population': [2646000, 9411000, 1636000, 9209944]
}
df_from_dict = pd.DataFrame(data)


# Method 2: From a list of lists
data = [
  ['Brooklyn', 'US', 2646000],
  ['Seoul', 'South Korea', 9411000],
  ['Barcelona', 'Spain', 1636000],
  ['Mexico City', 'Mexico', 9209944]
]
df_from_list = pd.DataFrame(data, columns=['city', 'country', 'population'])


# Method 3: From a CSV file (examples)
# df_from_csv = pd.read_csv('my_filename.csv')
# df_from_tsv = pd.read_csv('my_filename.tsv', delimiter='\t')

df_from_dict

Unnamed: 0,city,country,population
0,Brooklyn,US,2646000
1,Seoul,South Korea,9411000
2,Barcelona,Spain,1636000
3,Mexico City,Mexico,9209944


## 2. Inspecting the Data

### 1. `.head()`
The method shows you the first few rows of your `DataFrame`. It's perfect for getting a quick snapshot of your data to make sure it loaded correctly and to see what your columns look like.

### 2. `.tail()`
The method does the exact opposite: it shows you the last few rows of your `DataFrame`. This is useful for checking the end of a dataset, especially with time-series data where the last rows are the most recent.


### 3. `apps.head` vs. `apps.head()`

Why do `apps.head` and `apps.tail` give a "strange output"?

* Analogy: Think of it like a remote control. `apps.head` is like pointing to the "Play" button. You're just referencing the button itself. The output you'll see, `<bound method DataFrame.head of ...>`, is Python telling you, "This is the 'head' method, which is part of the DataFrame object." You haven't pressed it yet.

  * `apps.head()` is like actually pressing the "Play" button. The parentheses `()` execute or call the method, telling it to run and do its job, which is to show you the first 5 rows of the DataFrame.

In [10]:
# Popular mobile apps
app_data = {
  'app_name': ['YouTube', 'TikTok', 'Instagram', 'Spotify', 'Duolingo', 'Twitter', 'Headspace', 'Discord', 'Depop'],
  'category': ['Video', 'Social Media', 'Social Media', 'Music', 'Education', 'Social Media', 'Health', 'Communication', 'Shopping'],
  'rating': [4.7, 4.6, 4.5, 4.6, 4.7, 4.3, None, 4.7, 4.4],
  'downloads_millions': [5000, 3000, 3500, 2000, None, 1500, 500, 600, 200]
}

# Create the DataFrame
apps = pd.DataFrame(app_data)
apps

Unnamed: 0,app_name,category,rating,downloads_millions
0,YouTube,Video,4.7,5000.0
1,TikTok,Social Media,4.6,3000.0
2,Instagram,Social Media,4.5,3500.0
3,Spotify,Music,4.6,2000.0
4,Duolingo,Education,4.7,
5,Twitter,Social Media,4.3,1500.0
6,Headspace,Health,,500.0
7,Discord,Communication,4.7,600.0
8,Depop,Shopping,4.4,200.0


In [11]:
# This calls the method and shows the first 5 rows
apps.head()

Unnamed: 0,app_name,category,rating,downloads_millions
0,YouTube,Video,4.7,5000.0
1,TikTok,Social Media,4.6,3000.0
2,Instagram,Social Media,4.5,3500.0
3,Spotify,Music,4.6,2000.0
4,Duolingo,Education,4.7,


In [12]:
apps.head

<bound method NDFrame.head of     app_name       category  rating  downloads_millions
0    YouTube          Video     4.7              5000.0
1     TikTok   Social Media     4.6              3000.0
2  Instagram   Social Media     4.5              3500.0
3    Spotify          Music     4.6              2000.0
4   Duolingo      Education     4.7                 NaN
5    Twitter   Social Media     4.3              1500.0
6  Headspace         Health     NaN               500.0
7    Discord  Communication     4.7               600.0
8      Depop       Shopping     4.4               200.0>

In [13]:
# This calls the method and shows the last 5 rows
apps.tail()

Unnamed: 0,app_name,category,rating,downloads_millions
4,Duolingo,Education,4.7,
5,Twitter,Social Media,4.3,1500.0
6,Headspace,Health,,500.0
7,Discord,Communication,4.7,600.0
8,Depop,Shopping,4.4,200.0


In [14]:
apps.tail

<bound method NDFrame.tail of     app_name       category  rating  downloads_millions
0    YouTube          Video     4.7              5000.0
1     TikTok   Social Media     4.6              3000.0
2  Instagram   Social Media     4.5              3500.0
3    Spotify          Music     4.6              2000.0
4   Duolingo      Education     4.7                 NaN
5    Twitter   Social Media     4.3              1500.0
6  Headspace         Health     NaN               500.0
7    Discord  Communication     4.7               600.0
8      Depop       Shopping     4.4               200.0>

## 3. Getting More Info

### 1. `.info()`
Gives you a technical summary. It tells you the data type of each column (object usually means string), and how many non-null (i.e., not empty) values there are. This is great for spotting missing data at a glance.

There are a few notable pieces of information in this output:

* 9 `entries` means that there are 9 rows in the dataset.

* The `rating` and `downloads_millions` columns are each missing 1 value (only 8 non-null).

* The `Dtype` data describes the data type of each column.

  * Decimal numbers are stored as `float64` and whole numbers are stored as `int64` (doesn't appear).

  * Columns that store strings are represented by object. If columns stored other complex data types, like dictionaries, dates, or user-defined objects would also appear as object.



### 2. `.describe()`
Gives you a statistical summary of the numerical columns (`count`, `mean`, `standard deviation`, etc.). Using `include='all'` will make it try to describe the non-numerical columns too (giving you `counts`, `unique values`, and the most frequent value: `freq`).



In [15]:
# Technical summary
apps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   app_name            9 non-null      object 
 1   category            9 non-null      object 
 2   rating              8 non-null      float64
 3   downloads_millions  8 non-null      float64
dtypes: float64(2), object(2)
memory usage: 420.0+ bytes


In [16]:
# Statistical summary of numerical columns
apps.describe()

Unnamed: 0,rating,downloads_millions
count,8.0,8.0
mean,4.5625,2037.5
std,0.150594,1687.71824
min,4.3,200.0
25%,4.475,575.0
50%,4.6,1750.0
75%,4.7,3125.0
max,4.7,5000.0


In [17]:
# Statistical summary of ALL columns
apps.describe(include='all')

Unnamed: 0,app_name,category,rating,downloads_millions
count,9,9,8.0,8.0
unique,9,7,,
top,YouTube,Social Media,,
freq,1,3,,
mean,,,4.5625,2037.5
std,,,0.150594,1687.71824
min,,,4.3,200.0
25%,,,4.475,575.0
50%,,,4.6,1750.0
75%,,,4.7,3125.0


## 4. Selecting Column(s)

### 1. Column selection
This is the foundation of data manipulation with Pandas.

* Ex: If we wanted only the `name` column, we could access it using `characters['name']` or `characters.name`.

    * The bracket style (`df['column_name']`) is slightly more common, as it's more versatile. For example, `df.column name` is unusable since the column name has a space in it. It would have to be written as `df['column_name']`.

### 2. Multiple column selection

This allows you to look at two or more variables at the same time.

* Ex: We can access multiple columns by using a Python `list` of column names. For example, the following line of code would return the `name` and `level` columns:

    * `only_name_and_level_df = characters[['name', 'level']]`

### 3. `.drop()`

This allows for the exclusion of a column.

* Ex: Sometimes we want everything except one column. That’s where `.drop()` comes in handy:

    * `removed_alignment_df = characters.drop("alignment", axis = 1)`

    The above line of code creates a new `DataFrame` named `removed_alignment_df` that contains all of the columns from `characters` except for the `alignment` column. `axis = 1` tells Pandas that we want to drop a column rather than a row.

In [18]:
import pandas as pd

# D&D characters data
characters_data = {
'name': ['Thorne', 'Elira', 'Glim', 'Brug', 'Nyx', 'Kael', 'Mira', 'Drogan', 'Zara', 'Fenwick'],
'race': ['Elf', 'Human', 'Gnome', 'Half-Orc', 'Tiefling', 'Dragonborn', 'Halfling', 'Dwarf', 'Aasimar', 'Goblin'],
'class': ['Ranger', 'Cleric', 'Wizard', 'Barbarian', 'Rogue', 'Paladin', 'Bard', 'Fighter', 'Sorcerer', 'Warlock'],
'level': [5, 3, 4, 2, 6, 7, 3, 5, 4, 2],
'hp': [42, 28, 33, 25, 48, 56, 30, 44, 36, 24],
'alignment': ['Chaotic Good', 'Lawful Good', 'Neutral', 'Chaotic Neutral', 'Chaotic Evil', 'Lawful Neutral', 'Neutral Good', 'Neutral', 'Chaotic Good', 'Lawful Evil']
}

# Create the DataFrame
characters = pd.DataFrame(characters_data)
characters

Unnamed: 0,name,race,class,level,hp,alignment
0,Thorne,Elf,Ranger,5,42,Chaotic Good
1,Elira,Human,Cleric,3,28,Lawful Good
2,Glim,Gnome,Wizard,4,33,Neutral
3,Brug,Half-Orc,Barbarian,2,25,Chaotic Neutral
4,Nyx,Tiefling,Rogue,6,48,Chaotic Evil
5,Kael,Dragonborn,Paladin,7,56,Lawful Neutral
6,Mira,Halfling,Bard,3,30,Neutral Good
7,Drogan,Dwarf,Fighter,5,44,Neutral
8,Zara,Aasimar,Sorcerer,4,36,Chaotic Good
9,Fenwick,Goblin,Warlock,2,24,Lawful Evil


In [19]:
# This selects the 'name' column and returns it as a pandas Series
character_names = characters['name']
character_names

0     Thorne
1      Elira
2       Glim
3       Brug
4        Nyx
5       Kael
6       Mira
7     Drogan
8       Zara
9    Fenwick
Name: name, dtype: object

In [20]:
basic_stats = characters['name', 'level', 'hp']
basic_stats

KeyError: ('name', 'level', 'hp')

In [21]:
# Include a nested list to select multiple columns
basic_stats = characters[['name', 'level', 'hp']]
basic_stats

Unnamed: 0,name,level,hp
0,Thorne,5,42
1,Elira,3,28
2,Glim,4,33
3,Brug,2,25
4,Nyx,6,48
5,Kael,7,56
6,Mira,3,30
7,Drogan,5,44
8,Zara,4,36
9,Fenwick,2,24


In [22]:
removed_alignment = characters.drop('alignment')
removed_alignment

KeyError: "['alignment'] not found in axis"

In [23]:
# Include the `axis = ...` argument
removed_alignment = characters.drop('alignment', axis=1)
removed_alignment

Unnamed: 0,name,race,class,level,hp
0,Thorne,Elf,Ranger,5,42
1,Elira,Human,Cleric,3,28
2,Glim,Gnome,Wizard,4,33
3,Brug,Half-Orc,Barbarian,2,25
4,Nyx,Tiefling,Rogue,6,48
5,Kael,Dragonborn,Paladin,7,56
6,Mira,Halfling,Bard,3,30
7,Drogan,Dwarf,Fighter,5,44
8,Zara,Aasimar,Sorcerer,4,36
9,Fenwick,Goblin,Warlock,2,24


## 5. Filtering & Subsetting

### 1. Filtering with a Single Condition 
This is the most basic form of filtering, where you test for one thing.

* You create a single `True`/`False` test and use it to select rows from your `DataFrame`.

    * Ex: Find High-Level Characters
      
      1. Create the Mask The code `characters['level'] > 5` goes row by row and asks, "Is the value in the 'level' column greater than 5?" It generates a pandas `Series` of `True` and `False` answers.
      
      2. Apply the Mask Placing that `high_level_mask` inside `characters[...]` tells the `DataFrame`: "Give me a new table containing only the rows where the mask is True."

### 2. Filtering with Multiple Conditions `&` (AND) 
You use the `&` (AND) `operator` when you need rows to satisfy all of your conditions at the same time.

* Think of `&` as "and also." A row will only be selected if `Condition A` is true AND `Condition B` is also true. If either one is `False`, the row is ignored.
    
    * Ex: Find Halfling Bards
      
      1. Is the `race` column equal to `'Halfling'`? (`True` or `False`)
      
      2. Is the `class` column equal to `'Bard'`? (`True` or `False`)
    
    A row is only included in the final `halfling_bards` `DataFrame` if the answer to both questions is `True`.
    
    **Important:** Each condition must be wrapped in its own set of parentheses `()`. This is because of Python's rules for how it evaluates code.

### 3. Filtering with Multiple Conditions `|` (OR) 
You use the `|` (OR) `operator` when a row only needs to satisfy at least one of your conditions.

* Think of `|` as "or maybe." A row will be selected if `Condition A` is `True`, OR `Condition B` is `True`, OR `Condition C` is `True`, etc. As long as a row meets at least one test, it makes the cut.
  
    * Ex: Find All Magic Users

        * Is the `class` `'Wizard'`?
        
        * If not, is the `class` `'Sorcerer'`?
        
        * If not, is the `class` `'Warlock'`?
    
    * If the answer to any of those questions is `True`, the row is included in the final `magic_users` `DataFrame`. Just like with `&`, each condition must be wrapped in parentheses `()`.

In [24]:
# Filter for characters with a level greater than 5
high_level = characters['level'] > 5
characters[high_level]

Unnamed: 0,name,race,class,level,hp,alignment
4,Nyx,Tiefling,Rogue,6,48,Chaotic Evil
5,Kael,Dragonborn,Paladin,7,56,Lawful Neutral


In [25]:
# Filter for Halfling Bards (using AND)
halfling_bards = characters[
    (characters['race'] == 'Halfling') &
    (characters['class'] == 'Bard')
]
halfling_bards

Unnamed: 0,name,race,class,level,hp,alignment
6,Mira,Halfling,Bard,3,30,Neutral Good


In [26]:
# Filter for magic-using classes (using OR)
magic_users = characters[
    (characters['class'] == 'Wizard') |
    (characters['class'] == 'Sorcerer') |
    (characters['class'] == 'Warlock')
]
magic_users

Unnamed: 0,name,race,class,level,hp,alignment
2,Glim,Gnome,Wizard,4,33,Neutral
8,Zara,Aasimar,Sorcerer,4,36,Chaotic Good
9,Fenwick,Goblin,Warlock,2,24,Lawful Evil


## 6. Modifying the DataFrame

### 1. Add a New Column
You can create a new column by assigning a list of values to a new column name using square brackets. The list must contain the same number of elements as there are rows in the `DataFrame`.

### 2. Sort by Values`
The `.sort_values()` method reorders the `DataFrame` based on the data in a specified column.
* By default, the sort is `ascending` (A-Z, 0-9).
    
* To sort in `descending` order (Z-A, 9-0), use the argument `ascending=False`.
    
* This operation `returns` a new, sorted `DataFrame`.

### 3. Rename Columns
Use the `.rename()` method to change column names. You pass a dictionary to the columns parameter where the keys are the old names and the values are the new names.
    
* You can assign the result of the `.rename()` operation back to your original variable name. This effectively replaces the old `DataFrame` with the new, modified one.
    
* The `inplace=True` argument modifies the original `DataFrame` directly, rather than returning a new one.

In [27]:
import pandas as pd

# App data
app_data = {
  'app_name': ['YouTube', 'TikTok', 'Instagram', 'Spotify', 'Duolingo', 'Twitter', 'Headspace', 'Discord', 'Depop'],
  'category': ['Video', 'Social Media', 'Social Media', 'Music', 'Education', 'Social Media', 'Health', 'Communication', 'Shopping'],
  'rating': [4.7, 4.6, 4.5, 4.6, 4.7, 4.3, None, 4.7, 4.4],
  'downloads_millions': [5000, 3000, 3500, 2000, None, 1500, 500, 600, 200]
}

# Create the DataFrame
apps = pd.DataFrame(app_data)
apps

Unnamed: 0,app_name,category,rating,downloads_millions
0,YouTube,Video,4.7,5000.0
1,TikTok,Social Media,4.6,3000.0
2,Instagram,Social Media,4.5,3500.0
3,Spotify,Music,4.6,2000.0
4,Duolingo,Education,4.7,
5,Twitter,Social Media,4.3,1500.0
6,Headspace,Health,,500.0
7,Discord,Communication,4.7,600.0
8,Depop,Shopping,4.4,200.0


In [28]:
# Adding a new column
apps['downloaded'] = [False, False, False, False, False, True, False, True, False]
apps

Unnamed: 0,app_name,category,rating,downloads_millions,downloaded
0,YouTube,Video,4.7,5000.0,False
1,TikTok,Social Media,4.6,3000.0,False
2,Instagram,Social Media,4.5,3500.0,False
3,Spotify,Music,4.6,2000.0,False
4,Duolingo,Education,4.7,,False
5,Twitter,Social Media,4.3,1500.0,True
6,Headspace,Health,,500.0,False
7,Discord,Communication,4.7,600.0,True
8,Depop,Shopping,4.4,200.0,False


In [29]:
# Sorting by a column's values
apps_highest_rating = apps.sort_values('rating', ascending=False)
apps_highest_rating

Unnamed: 0,app_name,category,rating,downloads_millions,downloaded
0,YouTube,Video,4.7,5000.0,False
4,Duolingo,Education,4.7,,False
7,Discord,Communication,4.7,600.0,True
1,TikTok,Social Media,4.6,3000.0,False
3,Spotify,Music,4.6,2000.0,False
2,Instagram,Social Media,4.5,3500.0,False
8,Depop,Shopping,4.4,200.0,False
5,Twitter,Social Media,4.3,1500.0,True
6,Headspace,Health,,500.0,False


In [30]:
# Renaming a column
# Without modifying the original DataFrame
apps = apps.rename(columns={'app_name':'app'})
apps

Unnamed: 0,app,category,rating,downloads_millions,downloaded
0,YouTube,Video,4.7,5000.0,False
1,TikTok,Social Media,4.6,3000.0,False
2,Instagram,Social Media,4.5,3500.0,False
3,Spotify,Music,4.6,2000.0,False
4,Duolingo,Education,4.7,,False
5,Twitter,Social Media,4.3,1500.0,True
6,Headspace,Health,,500.0,False
7,Discord,Communication,4.7,600.0,True
8,Depop,Shopping,4.4,200.0,False


In [31]:
# Alternatively,
# 'inplace=True' modifies the original 'apps' DataFrame directly
apps.rename(columns={'app_name':'app'}, inplace=True)
apps

Unnamed: 0,app,category,rating,downloads_millions,downloaded
0,YouTube,Video,4.7,5000.0,False
1,TikTok,Social Media,4.6,3000.0,False
2,Instagram,Social Media,4.5,3500.0,False
3,Spotify,Music,4.6,2000.0,False
4,Duolingo,Education,4.7,,False
5,Twitter,Social Media,4.3,1500.0,True
6,Headspace,Health,,500.0,False
7,Discord,Communication,4.7,600.0,True
8,Depop,Shopping,4.4,200.0,False
