## Part 2: Basic Data Structures, Series, and Selections

### 1. Motivation: Without Pandas

In pure Python, a common way to represent tabular data is using dictionaries:  
- Keys are column names.  
- Values are lists (rows) or nested dictionaries.

Example: build a small `people` collection with `first`, `last`, `email`.

In [None]:
df_small = pd.DataFrame({
    "first": ["Alice", "Bob", "Carol"],
    "last": ["Smith", "Jones", "Lee"],
    "email": ["alice@example.com", "bob@example.com", "carol@example.com"],
})

This works, but lacks convenient vectorized operations, alignment, and metadata.

### 2. Converting Dict to DataFrame

In [None]:
import pandas as pd

df_small = pd.DataFrame(people)
df_small

* Columns correspond to keys.
* Rows are inferred from aligned list lengths.

In [None]:
# Similar behaviour for numpy arrays
import numpy as np
data = np.array([[10, 2, 1993], [24, 8, 2006], [15, 5, 1810]])
df_array = pd.DataFrame(data, columns=['day', 'month', 'year'])
df_array

### 3. Series and Column Access

In [None]:
# Single column access returns a Series
df_small['email']
df_small.email  # shorthand, but can conflict

> **Caveat**: attribute access can break if column name clashes with existing DataFrame methods/attributes or if column name has spaces, punctuation, etc.

*`df['email']` is unambiguous; `df.email` is syntactic sugar that fails if the column is named e.g. `count` or contains characters not valid as Python identifiers.* ([pandas.pydata.org][6]) (general practice, common in Pandas docs)

---

[6]: https://pandas.pydata.org/docs/dev/whatsnew/v2.3.0.html?utm_source=chatgpt.com "What's new in 2.3.0 (June 4, 2025) - Pandas"

### 4. Selecting Multiple Columns and Inspecting Available Columns

In [None]:
# Suppose we want first + email only
df_small[['first', 'email']]

# List all columns in a DataFrame
df_small.columns

### 5. Indexing with `loc` and `iloc` (on the small df)

In [None]:
# .loc uses labels
df_small.loc[0]                             # first row by label
df_small.loc[0, 'email']                    # scalar
df_small.loc[[0, 1], ['first', 'email']]    # multiple rows and columns

# .iloc uses integer positions
df_small.iloc[0]         # first row
df_small.iloc[0, 2]      # first row, third column (email)

### 6. Return to the Big Survey Data

In [None]:
# Load the data from a CSV file
df = pd.read_csv('data\survey_results_public.csv')

In [None]:
# Re-check shape
df.shape

In [None]:
# Example: explore a column (e.g., "Employment")
# NOTE: column names are case-sensitive and must match exactly what the schema shows.
# If the column is "Employment", we can do:

df.loc[0]                                 # first respondent
df.loc[0, 'Employment']                   # their answer to Employment
df.loc[[0, 1, 2], 'Employment']           # first three respondents' Employment
df.loc[0:2, 'Employment']                 # slicing; inclusive of 2
df.loc[0:2, 'Employment':'EdLevel']       # column range selection, inclusive

> *`.loc[0:2]` is label slicing and **inclusive** of the end; this trips people coming from Python list slicing.* ([pandas.pydata.org][7])

---

[7]: https://pandas.pydata.org/docs/whatsnew/index.html "Release notes — pandas 2.3.1 documentation - PyData |"

### Exercise for Part 2

In the small df:
- Modify the dictionary to add a column `uid` containing a unique identifier for that person, in the form of: **FLDDMMYY**
    - **F**: First letter of the first name
    - **L**: First letter of the last name
    - **DD**: day of birth
    - **MM**: month of birth
    - **YY**: year of birth

(You can use the dates from the numpy array in section 2)

In the big survey DataFrame:
- Retrieve the rows 10 through 15 and print their `Employment` through `EdLevel` columns (adjust column names if necessary using schema).
- Print the last 10 answers from the last column.
- Print 5 answers from the exact middle of the dataframe. 
- (Optional) Count how many respondents answered "Yes" when asked if they currently use AI tools in their development process (`AISelect` column).  