Notebook 1: 1.1 Exploring and Pre-processing a Dataset using Pandas

In [41]:
import numpy as np
import pandas as pd

In [49]:
# Import the NumPy library, which is fundamental for numerical operations in Python.
# It's often used for working with arrays and matrices.
import numpy as np

# Import the Pandas library, which is used for data manipulation and analysis.
# It provides data structures like DataFrame, which is excellent for handling tabular data.
import pandas as pd

# Load the dataset from an Excel file hosted online.
# 'df_can' will be a Pandas DataFrame containing the data.
df_can = pd.read_excel(
    'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork/Data%20Files/Canada.xlsx',
    sheet_name='Canada by Citizenship',  # Specifies which sheet in the Excel file to read.
    skiprows=range(20),                 # Skips the first 20 rows of the Excel sheet (header rows).
    skipfooter=2                        # Skips the last 2 rows of the Excel sheet (footer rows).
)
# Print a confirmation message to indicate that the data has been successfully loaded.
print('Data read into a pandas dataframe!')

Data read into a pandas dataframe!


In [50]:
# --- Initial cleaning and renaming as in the notebook ---

# Remove unnecessary columns from the DataFrame.
# 'AREA', 'REG', 'DEV', 'Type', 'Coverage' are columns that are not needed for the analysis.
# `axis=1` indicates that we are dropping columns (as opposed to rows).
# `inplace=True` modifies the DataFrame `df_can` directly.
df_can.drop(['AREA','REG','DEV','Type','Coverage'], axis=1, inplace=True)

# Rename some of the columns to be more descriptive and easier to use.
# 'OdName' is renamed to 'Country'.
# 'AreaName' is renamed to 'Continent'.
# 'RegName' is renamed to 'Region'.
# `inplace=True` modifies the DataFrame `df_can` directly.
df_can.rename(columns={'OdName':'Country', 'AreaName':'Continent', 'RegName':'Region'}, inplace=True)

# Set the 'Country' column as the index of the DataFrame.
# This makes it easier to look up data by country name.
# `inplace=True` modifies the DataFrame `df_can` directly.
df_can.set_index('Country', inplace=True) # Set index early for convenience

# Calculate the total number of immigrants for each country over all the years.
# `iloc[:, 3:-1]` selects all rows and columns from the 4th column (index 3) up to, but not including, the last column.
# This range is assumed to contain the yearly immigration data.
# `.sum(axis=1)` calculates the sum across the columns (horizontally) for each row (country).
# This line assumes that the columns from index 3 up to the second to last are the year columns and are numeric.
df_can['Total'] = df_can.iloc[:, 3:-1].sum(axis=1) # Calculate Total for numeric year columns only before converting them to string

# --- Note on converting column names (years) to string type ---
# Convert column names (years) to string type for consistency, AFTER 'Total' is calculated from numeric years.
# The original notebook does this later, but for some exercises, it's good to have them consistent.
# However, for 'Total' calculation to work easily with sum(), years should be numeric.
# Let's keep years as integers for now for easier calculations, and convert to strings just before plotting or when strictly needed.
# For many pandas operations, integer column names are fine.

# --- Addressing potential issues with 'Total' calculation if non-numeric columns are present ---
# The 'Total' sum in the original notebook `df_can['Total'] = df_can.sum(axis=1)` might fail if columns like 'Continent' are strings
# or if other non-numeric columns were still present in the sum range.
# Corrected 'Total' sum:
# Select only columns that have a numeric data type.
numeric_cols = df_can.select_dtypes(include=np.number).columns
# Calculate the sum of these numeric columns for each row and store it in the 'Total' column.
# This is a more robust way to calculate the total, ensuring only numbers are summed.
df_can['Total'] = df_can[numeric_cols].sum(axis=1)

# The `years` variable as defined in the Matplotlib notebook is very useful.
# Let's define it here too, but keep original year columns as integers for now in df_can for easier sum/describe.

# Create a list of integers representing the years from 1980 to 2013 (exclusive of 2014).
# This list can be used for accessing year-specific data.
years_int = list(range(1980, 2014))

# Create a list of strings representing the years from 1980 to 2013.
# This is useful when year column names need to be strings (e.g., for some plotting libraries or specific indexing).
years_str = list(map(str, range(1980, 2014)))

# Print a message indicating that the initial setup and data loading are complete.
print("Initial setup and data loading complete. df_can is ready.")

# Display the first 2 rows of the cleaned and processed DataFrame.
# This is a quick way to inspect the DataFrame and verify the changes.
df_can.head(2)

Initial setup and data loading complete. df_can is ready.


Unnamed: 0_level_0,Continent,Region,DevName,1980,1981,1982,1983,1984,1985,1986,...,2005,2006,2007,2008,2009,2010,2011,2012,2013,Total
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,Asia,Southern Asia,Developing regions,16,39,39,47,71,340,496,...,3436,3009,2652,2111,1746,1758,2203,2635,2004,115274
Albania,Europe,Southern Europe,Developed regions,1,0,0,0,0,0,1,...,1223,856,702,560,716,561,539,620,603,30795


---

### `df_can.head()` and `df_can.tail()`
**(Original content for `df_can.head()` and `df_can.tail()` is assumed here)**

**NEW EXAMPLE: Displaying a specific number of rows**
You can pass a number to `head()` or `tail()` to specify how many rows you want to see.

In [54]:
print("Displaying the first 3 rows:")
print(df_can.head(3))

Displaying the first 3 rows:
            Continent           Region             DevName  1980  1981  1982  \
Country                                                                        
Afghanistan      Asia    Southern Asia  Developing regions    16    39    39   
Albania        Europe  Southern Europe   Developed regions     1     0     0   
Algeria        Africa  Northern Africa  Developing regions    80    67    71   

             1983  1984  1985  1986  ...  2005  2006  2007  2008  2009  2010  \
Country                              ...                                       
Afghanistan    47    71   340   496  ...  3436  3009  2652  2111  1746  1758   
Albania         0     0     0     1  ...  1223   856   702   560   716   561   
Algeria        69    63    44    69  ...  3626  4807  3623  4005  5393  4752   

             2011  2012  2013   Total  
Country                                
Afghanistan  2203  2635  2004  115274  
Albania       539   620   603   30795  
Algeria  

In [55]:
print("\nDisplaying the last 2 rows:")
print(df_can.tail(2))


Displaying the last 2 rows:
         Continent          Region             DevName  1980  1981  1982  \
Country                                                                    
Zambia      Africa  Eastern Africa  Developing regions    11    17    11   
Zimbabwe    Africa  Eastern Africa  Developing regions    72   114   102   

          1983  1984  1985  1986  ...  2005  2006  2007  2008  2009  2010  \
Country                           ...                                       
Zambia       7    16     9    15  ...    91    77    71    64    60   102   
Zimbabwe    44    32    29    43  ...   615   454   663   611   508   494   

          2011  2012  2013  Total  
Country                            
Zambia      69    46    59   3295  
Zimbabwe   434   437   407  16789  

[2 rows x 38 columns]


**INTERACTIVE EXERCISE: Viewing Top/Bottom N Rows**

In [None]:
# Interactive input for head()
try:
    n_head = int(input("Enter the number of rows you want to see from the top (e.g., 5): "))
    print(f"\nDisplaying the first {n_head} rows:")
    print(df_can.head(n_head))
except ValueError:
    print("Invalid input. Please enter an integer.")

# Interactive input for tail()
try:
    n_tail = int(input("\nEnter the number of rows you want to see from the bottom (e.g., 3): "))
    print(f"\nDisplaying the last {n_tail} rows:")
    print(df_can.tail(n_tail))
except ValueError:
    print("Invalid input. Please enter an integer.")

**Explanation:**
The code above prompts you to enter a number. This number is then used with the `head()` or `tail()` methods to display the exact number of rows you requested from the beginning or end of the DataFrame, respectively. If you enter "5" for the top rows, `df_can.head(5)` shows the first five countries' data.

---

### `df_can.info()`
**(Original content for `df_can.info()` is assumed here)**

**NEW EXAMPLE: Using `memory_usage='deep'` for more accurate memory information**

In [None]:
print("Detailed info including more accurate memory usage:")
df_can.info(verbose=True, memory_usage='deep')

**Explanation:**
Using `memory_usage='deep'` gives a more accurate estimate of the memory used by the DataFrame, especially if it contains object-type columns (like strings), as it introspects the data to account for the actual memory consumed by the objects.

---

### `df_can.columns` and `df_can.index`
**(Original content for `df_can.columns` and `df_can.index` and their types is assumed here)**

**NEW EXAMPLE: Accessing a specific column name or index label**

In [None]:
print("The first column name is:", df_can.columns[0])
print("The fifth column name is:", df_can.columns[4]) # Example with an integer year column

# Assuming 'Country' is the index as per df_can.set_index('Country', inplace=True)
print("\nThe first country in the index is:", df_can.index[0])
print("The tenth country in the index is:", df_can.index[9])

**INTERACTIVE EXERCISE: Checking for a column's existence**

In [None]:
col_to_check = input("Enter a column name to check if it exists in the DataFrame (e.g., 'Continent', 1995, 'Total', 'Population'): ")
if col_to_check.isdigit(): # Check if input can be an integer (for year columns)
    col_to_check = int(col_to_check)

if col_to_check in df_can.columns:
    print(f"Yes, the column '{col_to_check}' exists in the DataFrame.")
else:
    print(f"No, the column '{col_to_check}' does NOT exist in the DataFrame.")

**Explanation:**
This exercise takes a column name you provide. It then checks if this name is present in the `df_can.columns` list. The output tells you whether your specified column is part of the dataset. This is useful for avoiding errors when trying to access columns that might not exist or are misspelled. Note the conversion to `int` if the input is a digit, because our year columns are currently integers.

---

### `df_can.shape`
**(Original content for `df_can.shape` is assumed here)**

**NEW EXAMPLE: Accessing rows and columns from shape individually**

In [None]:
num_rows = df_can.shape[0]
num_cols = df_can.shape[1]
print(f"The DataFrame has {num_rows} rows and {num_cols} columns.")
print(f"The total number of data points (cells) is {df_can.size}, which is {num_rows} * {num_cols}.")

---

### Cleaning Data: `drop()` and `rename()`
**(Original content for `drop()` and `rename()` is assumed here. Note: `df_can` was already modified by these in the setup)**

**NEW EXAMPLE: Renaming multiple columns and dropping one more (hypothetically)**
Let's make a copy to demonstrate without altering `df_can` further for subsequent original notebook cells.

In [None]:
df_temp = df_can.copy()
# Hypothetically, let's say we want to rename 'Region' to 'GeographicRegion' and 1980 to 'Year_1980'
df_temp.rename(columns={'Region': 'GeographicRegion', 1980: 'Year_1980'}, inplace=True)

# Hypothetically, let's drop the 'DevName' column (which was already dropped, but for example's sake)
# If 'DevName' was still there: df_temp.drop(['DevName'], axis=1, inplace=True)
# Since it's gone, let's drop the newly renamed 'Year_1980' column for demonstration
if 'Year_1980' in df_temp.columns:
    df_temp.drop(['Year_1980'], axis=1, inplace=True)
    print("Dropped 'Year_1980' and renamed 'Region'. First 2 rows of temp df:")
    print(df_temp.head(2))
else:
    print("'Year_1980' not found to drop.")

del df_temp # Clean up

**INTERACTIVE EXERCISE: Renaming a column**

In [None]:
# Make a copy to avoid permanent changes to df_can for this exercise
df_exercise_rename = df_can.copy()

print("Current columns:", df_exercise_rename.columns.tolist())
col_to_rename = input("Enter a column name you want to rename (e.g., 'Continent'): ")

# Check if the column exists
if col_to_rename.isdigit(): # For year columns
    col_to_rename_actual = int(col_to_rename)
else:
    col_to_rename_actual = col_to_rename

if col_to_rename_actual in df_exercise_rename.columns:
    new_name = input(f"Enter the new name for '{col_to_rename}': ")
    df_exercise_rename.rename(columns={col_to_rename_actual: new_name}, inplace=True)
    print(f"\nColumn '{col_to_rename}' has been renamed to '{new_name}'.")
    print("Updated columns list:", df_exercise_rename.columns.tolist())
    print("\nFirst 2 rows with the new column name:")
    print(df_exercise_rename.head(2))
else:
    print(f"Column '{col_to_rename}' not found in the DataFrame.")

del df_exercise_rename # clean up

**Explanation:**
The code prompts you for a column you wish to rename and its new name. If the column exists (e.g., you type 'Continent' and then 'AreaOfOrigin'), it uses the `rename()` method to change the column's header. The output then shows the updated list of columns and the first few rows with the new column name. This is crucial for making datasets more readable or conforming to specific naming conventions.

---

### Adding a 'Total' column
**(Original content for adding 'Total' column is assumed here. It was done in the setup.)**
The original notebook has: `df_can['Total'] = df_can.sum(axis=1)`.
This would sum up *all* numeric columns for each row. If 'Continent', 'Region' etc., were numeric, they'd be included.
Our corrected setup:
`numeric_cols = df_can.select_dtypes(include=np.number).columns`
`df_can['Total'] = df_can[numeric_cols].sum(axis=1)`
This is more robust. The original notebook later converts year columns to string. If `sum(axis=1)` is called *after* that, it would only sum remaining numeric columns or error if none. It's important to sum when years are numeric.

**NEW EXAMPLE: Calculating total immigration for a specific decade**

In [None]:
# Example: Calculate total immigration for the 1980s
years_1980s = [year for year in years_int if 1980 <= year <= 1989]
df_can['Total_1980s'] = df_can[years_1980s].sum(axis=1)
print("DataFrame with 'Total_1980s' column (first 2 rows):")
print(df_can[['Total', 'Total_1980s']].head(2))
# df_can.drop('Total_1980s', axis=1, inplace=True) # Optional: clean up the new column

---
### `df_can.isnull().sum()` and `df_can.describe()`
**(Original content for these is assumed here)**

**NEW EXAMPLE: `describe()` for object columns**

In [None]:
print("Descriptive statistics for 'object' (e.g., string) type columns:")
print(df_can.describe(include='object'))

print("\nDescriptive statistics for all columns:")
print(df_can.describe(include='all'))
# Note: For 'all', if a column is numeric, it shows numeric stats. If object, it shows object stats.
# Mixed type columns might show stats for the majority type or error.

---
### Indexing and Selection (Slicing)

**(Original content for selecting columns `df.column_name`, `df['column']`, `df[['col1', 'col2']]` is assumed)**

**NEW EXAMPLE: Selecting columns using a list variable**

In [None]:
selected_years = [1980, 1985, 1990, 1995, 2000, 2005, 2010] # Integer years
# If year columns were strings: selected_years = ['1980', '1985', ..., '2010']
print(f"Data for years: {selected_years}")
print(df_can[selected_years].head(3))

**INTERACTIVE EXERCISE: Selecting and Viewing a Specific Column**

In [None]:
print("Available columns:", df_can.columns.tolist())
col_to_view = input("Enter the name of a single column you want to view (e.g., 'Continent', 1990, 'Total'): ")

# Convert to int if it's a year column name that's an integer
actual_col_name = int(col_to_view) if col_to_view.isdigit() and int(col_to_view) in years_int else col_to_view

if actual_col_name in df_can.columns:
    print(f"\nDisplaying the '{actual_col_name}' column (first 5 entries):")
    print(df_can[actual_col_name].head())
    print(f"\nType of this column: {type(df_can[actual_col_name])}")
else:
    print(f"Column '{actual_col_name}' not found.")

**Explanation:**
You're asked to enter a column name. The code then displays the first 5 entries of that column and its data type (which is a Pandas `Series` when you select a single column). For instance, if you enter 'Total', you'll see the total immigration figures for the first few countries.

---
### Select Row: `df.loc[label]` and `df.iloc[index]`
**(Original content including `set_index('Country', inplace=True)` is assumed. This was done in setup.)**

**NEW EXAMPLE: Using `.loc` with a list of countries and a slice of years**

In [None]:
# Ensure year columns are addressable, assuming they are integers:
# If years were strings: df_can.loc[['India', 'China', 'Pakistan'], '1980':'1985']
print(df_can.loc[['India', 'China', 'Pakistan'], 1980:1985]) # Integer column names for years

# Using .iloc for a range of rows and specific column positions
# Rows 3 to 5 (exclusive of 5), and columns 0, 2, and 4 (Continent, Region, 1980)
# Note: Column positions depend on current DataFrame structure.
# Continent is col 0, Region is col 1, DevName is col 2, then years...
# If columns are: ['Continent', 'Region', 'DevName', 1980, 1981, ..., 2013, 'Total']
# Then: Continent=0, Region=1, 1980=3 (assuming DevName is there)
# If DevName is dropped: Continent=0, Region=1, 1980=2 (assuming original structure from notebook)
# Let's get actual positions for robustness:
col_pos_continent = df_can.columns.get_loc('Continent')
col_pos_region = df_can.columns.get_loc('Region')
col_pos_1980 = df_can.columns.get_loc(1980) # year 1980
print("\nUsing iloc for rows 3-4 and columns for Continent, Region, and 1980:")
print(df_can.iloc[3:5, [col_pos_continent, col_pos_region, col_pos_1980]])

**INTERACTIVE EXERCISE: Fetching data for a specific country and year**

In [None]:
# df_can index is 'Country'
print("Some available countries:", df_can.index.tolist()[0:10]) # Show first 10 countries
country_input = input("Enter a country name to view its data (e.g., 'Japan'): ")

print("\nAvailable year columns (sample):", [c for c in df_can.columns if isinstance(c, int)][:5]) # Show some year columns
year_input_str = input("Enter a specific year (e.g., 1985 or 2012): ")

if country_input in df_can.index:
    try:
        year_input = int(year_input_str)
        if year_input in df_can.columns:
            immigrants = df_can.loc[country_input, year_input]
            print(f"\nNumber of immigrants from {country_input} in {year_input}: {immigrants}")
        else:
            print(f"The year {year_input} is not a valid column in the dataset.")
    except ValueError:
        print(f"Invalid year '{year_input_str}'. Please enter a numeric year.")
else:
    print(f"Country '{country_input}' not found in the index.")

**Explanation:**
This code asks for a country and a year. It then uses `df_can.loc[country, year]` to find the specific data point. For example, if you input 'Japan' and '1985', it will retrieve and display the number of immigrants from Japan in the year 1985. This demonstrates precise data retrieval using labels for both rows and columns.

---
### Converting column names to string (Years)
**(Original notebook content: `df_can.columns = list(map(str, df_can.columns))` and defining `years = list(map(str, range(1980, 2014)))`)**
This is a crucial step for consistency, especially when plotting or when column names might be ambiguous if numeric.
Let's apply this transformation now to `df_can` so subsequent examples align with the notebook's state where years are strings.

In [None]:
# Preserve numeric 'Total' column if it was stringified by map(str, ...)
total_col_data = None
if 'Total' in df_can.columns: # Should be true
    total_col_data = df_can['Total']

# Convert all column names to string
df_can.columns = list(map(str, df_can.columns))
years = list(map(str, range(1980, 2014))) # This is the 'years_str' list defined earlier

# If 'Total' became a string column name 'Total' and its data needs to be re-attached:
if 'Total' in df_can.columns and total_col_data is not None:
     df_can['Total'] = total_col_data # Ensure 'Total' remains numeric data, even if col name is str

print("Year column names are now strings.")
print("First 5 columns:", df_can.columns.tolist()[:5])
print("Data for Haiti, year '1980':", df_can.loc['Haiti', '1980'])

---
### Filtering based on a criteria
**(Original content on boolean series and filtering `df_can[df_can['Continent'] == 'Asia']` is assumed)**

**NEW EXAMPLE: Filtering with multiple conditions and `.isin()`**

In [None]:
# Filter for countries in 'Asia' or 'Europe' that had more than 10000 immigrants in '2013'
condition_continents = df_can['Continent'].isin(['Asia', 'Europe'])
# Ensure '2013' column is numeric for comparison, if it's not, convert it.
# Assuming '2013' is already numeric or can be converted:
df_can['2013'] = pd.to_numeric(df_can['2013'])
condition_immigration = df_can['2013'] > 10000

filtered_df = df_can[condition_continents & condition_immigration]
print("Countries in Asia or Europe with >10000 immigrants in 2013:")
print(filtered_df[['Continent', '2013', 'Total']].sort_values(by='2013', ascending=False))

**INTERACTIVE EXERCISE: Filtering by Region and Total Immigration**

In [None]:
print("Available regions (sample):", df_can['Region'].unique()[:5])
region_input = input("Enter a region to filter by (e.g., 'Southern Asia'): ")

try:
    min_total_immigrants = int(input("Enter the minimum total number of immigrants for that region (e.g., 50000): "))

    condition_region = df_can['Region'] == region_input
    # Ensure 'Total' column is numeric for comparison
    df_can['Total'] = pd.to_numeric(df_can['Total'])
    condition_total = df_can['Total'] > min_total_immigrants

    result_df = df_can[condition_region & condition_total]

    if not result_df.empty:
        print(f"\nCountries in '{region_input}' with more than {min_total_immigrants} total immigrants:")
        print(result_df[['Continent', 'Region', 'Total']].sort_values(by='Total', ascending=False))
    else:
        print(f"No countries found in '{region_input}' with more than {min_total_immigrants} total immigrants.")

except ValueError:
    print("Invalid input for minimum total immigrants. Please enter an integer.")
except Exception as e:
    print(f"An error occurred: {e}")

**Explanation:**
You provide a region (e.g., 'Southern Asia') and a minimum number for total immigration (e.g., 50000). The code filters the DataFrame to show only countries matching that region AND exceeding your specified total immigration. The results are then displayed, sorted by the total immigration in descending order. This demonstrates how to combine multiple criteria for more specific data extraction.

---
### Sorting Values: `sort_values()`
**(Original content on `sort_values()` is assumed)**

**NEW EXAMPLE: Sorting by multiple columns**

In [None]:
# Sort by 'Continent' (ascending) and then by 'Total' (descending) within each continent
df_sorted_multiple = df_can.sort_values(by=['Continent', 'Total'], ascending=[True, False])
print("DataFrame sorted by Continent (A-Z) then by Total immigration (High-Low):")
print(df_sorted_multiple[['Continent', 'Total']].head(10))

**INTERACTIVE EXERCISE: Sorting by a user-chosen column**

In [None]:
# Make a copy to ensure df_can remains in its original sorted state for other notebook cells
df_exercise_sort = df_can.copy()

print("Columns you can sort by (numeric ones are usually more meaningful for sorting values):")
# Show some numeric and some object type columns
numeric_sortable_cols = df_exercise_sort.select_dtypes(include=np.number).columns.tolist()
object_sortable_cols = ['Continent', 'Region'] # Index 'Country' is also sortable
print(f"Numeric: {numeric_sortable_cols}")
print(f"Categorical/Text: {object_sortable_cols}")

sort_column_input = input("Enter the column name to sort by (e.g., 'Total', 'Continent', '1992'): ")

# Check if column exists and handle type (numeric years are now strings)
if sort_column_input not in df_exercise_sort.columns:
    print(f"Column '{sort_column_input}' not found.")
else:
    order_input = input("Sort in ascending order? (yes/no, default no): ").lower()
    ascending_order = True if order_input == 'yes' else False

    # If sorting by a year column or 'Total', ensure it's numeric for proper sorting
    if sort_column_input in years or sort_column_input == 'Total':
        df_exercise_sort[sort_column_input] = pd.to_numeric(df_exercise_sort[sort_column_input])

    df_sorted = df_exercise_sort.sort_values(by=sort_column_input, ascending=ascending_order)
    print(f"\nTop 5 rows of the DataFrame sorted by '{sort_column_input}' in {'ascending' if ascending_order else 'descending'} order:")
    print(df_sorted[['Continent', 'Region', sort_column_input, 'Total']].head())

del df_exercise_sort # Clean up

**Explanation:**
You choose a column to sort by (e.g., '1992' or 'Continent') and whether you want it in ascending (A-Z, smallest to largest) or descending order. The code then applies `sort_values()` based on your choices and displays the top 5 rows of the sorted data, showing relevant columns like 'Continent', 'Region', your chosen sort column, and 'Total'. This helps in quickly finding extremes or ordering data as needed for analysis.

---
This completes the enhancements for the Pandas notebook. Next, the Matplotlib notebook.
The key is that `df_can` (with string year columns and 'Country' as index) and the `years` list (of string years) are correctly set up for Matplotlib.

In [None]:
# Final state of df_can for Matplotlib notebook:
# Index: Country
# Columns: 'Continent', 'Region', 'DevName' (if not dropped, but it was), '1980', '1981', ..., '2013' (as strings), 'Total' (numeric)
# The 'Total_1980s' column was temporary for an example, let's ensure it's not there.
if 'Total_1980s' in df_can.columns:
    df_can.drop('Total_1980s', axis=1, inplace=True)

# Ensure all year columns in the 'years' list are string type and exist.
# And that their data in df_can is numeric for plotting.
for year_col in years:
    if year_col in df_can.columns:
        df_can[year_col] = pd.to_numeric(df_can[year_col], errors='coerce')
    else:
        print(f"Warning: Year column {year_col} not found in df_can for numeric conversion.")

# Ensure 'Total' column is numeric.
if 'Total' in df_can.columns:
    df_can['Total'] = pd.to_numeric(df_can['Total'], errors='coerce')

print("\ndf_can is now prepped for the Matplotlib notebook.")
print("Sample of year columns (should be strings):", years[:3])
print("Data types of some year columns in df_can:")
print(df_can[years[:3]].dtypes)
df_can.head(2)

---
## Enhanced Notebook 2: 1.2 Introduction to Matplotlib and Line Plots

I'll assume `df_can` is prepared as at the end of the enhanced Pandas section (Country as index, year columns as strings containing numeric data, `years` list of string years).

**(Initial setup cells for Matplotlib from your .ipynb file are assumed: `%matplotlib inline`, `import matplotlib as mpl`, `import matplotlib.pyplot as plt`, style settings.)**

In [None]:
# Ensure matplotlib is imported, and styles are set as in the notebook
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
# print(plt.style.available) # Show available styles
mpl.style.use(['ggplot']) # optional: for ggplot-like style

# df_can and years should be available from the previous (Pandas) notebook's execution.
# If not, they would need to be re-loaded and pre-processed here.
# For this exercise, we assume df_can is ready:
# - Index is 'Country'
# - Columns '1980' through '2013' are strings, but their data is numeric.
# - 'years' is a list of strings: ['1980', '1981', ..., '2013']

# Verify df_can and years (if running this notebook standalone, you'd need to load and prep df_can)
if 'df_can' not in globals() or 'years' not in globals():
    print("df_can or years not defined. Please run the Pandas preprocessing notebook first or load data here.")
    # Minimal df_can setup for Matplotlib examples if needed:
    # df_can = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork/Data%20Files/Canada.csv')
    # df_can.set_index('Country', inplace=True)
    # df_can.index.name = None
    # years = list(map(str, range(1980, 2014)))
    # # Ensure year columns are numeric for plotting if they were read as objects from CSV
    # for year_col_str in years:
    #   if year_col_str in df_can.columns:
    #       df_can[year_col_str] = pd.to_numeric(df_can[year_col_str], errors='coerce')
else:
    print("df_can and years are available.")

# The Haiti example uses .loc and expects string years in `years` list.
# Original notebook converts haiti.index to int for plotting. This is a good practice.

---
### Plotting a line graph for Haiti
**(Original content for plotting Haiti's immigration is assumed here)**
`haiti = df_can.loc['Haiti', years]`
`haiti.plot()`
`plt.title(...)`, `plt.ylabel(...)`, `plt.xlabel(...)`, `plt.show()`

**NEW EXAMPLE: Plotting for a different country with style adjustments**

In [None]:
# Let's plot for 'Philippines' with a different line style and color
if 'Philippines' in df_can.index:
    philippines = df_can.loc['Philippines', years]
    philippines.index = philippines.index.map(int) # Convert index to int for plotting

    philippines.plot(kind='line', color='green', linestyle='--', marker='o', markersize=4)

    plt.title('Immigration from Philippines')
    plt.ylabel('Number of Immigrants')
    plt.xlabel('Years')
    plt.grid(True) # Add a grid
    plt.legend(['Philippines']) # Add a legend
    plt.show()
else:
    print("Philippines not found in dataset.")

**Explanation:**
This example plots the immigration trend for the Philippines.
* `color='green'` sets the line color.
* `linestyle='--'` makes the line dashed.
* `marker='o'` adds a small circle at each data point.
* `markersize=4` controls the size of these markers.
* `plt.grid(True)` adds a grid to the background for better readability.
* `plt.legend()` explicitly adds a legend, which is good practice.

---
### Annotating the plot (Haiti Earthquake)
**(Original content for `haiti.index = haiti.index.map(int)` and `plt.text(2000, 6000, '2010 Earthquake')` is assumed)**

**INTERACTIVE EXERCISE: Plotting a country and annotating a specific year**

In [None]:
# Ensure 'years' contains string representations of years, e.g., '1980', '1981', ...
# Ensure df_can contains numeric data for these year columns.

print("Available countries (sample):", df_can.index.tolist()[:10])
country_to_plot = input("Enter country name for plotting (e.g., 'India'): ")

if country_to_plot in df_can.index:
    country_series = df_can.loc[country_to_plot, years].copy() # Use .copy()
    country_series.index = country_series.index.map(int) # Convert string years to int for plotting X-axis

    country_series.plot(kind='line')
    plt.title(f'Immigration from {country_to_plot}')
    plt.ylabel('Number of Immigrants')
    plt.xlabel('Years')

    annotate_year_str = input(f"Enter a year to annotate on {country_to_plot}'s plot (e.g., 1990, 2005): ")
    annotation_text = input("Enter the text for annotation (e.g., 'Significant Event'): ")

    try:
        annotate_year_int = int(annotate_year_str)
        if annotate_year_int in country_series.index:
            y_value_at_year = country_series.loc[annotate_year_int]
            # Place text slightly above the point
            plt.text(annotate_year_int, y_value_at_year + 500, annotation_text, horizontalalignment='center')
            plt.axvline(x=annotate_year_int, color='red', linestyle=':', linewidth=0.8) # Add a vertical line
            print(f"Annotation added at year {annotate_year_int} with y-value near {y_value_at_year}.")
        else:
            print(f"Year {annotate_year_int} not in the data range for {country_to_plot} or data missing.")
    except ValueError:
        print("Invalid year for annotation. Please enter a numeric year.")
    except KeyError:
        print(f"Data for year {annotate_year_int} not found for {country_to_plot}.")

    plt.show()
else:
    print(f"Country '{country_to_plot}' not found.")

**Explanation:**
1.  You select a country to plot. Its immigration trend is displayed.
2.  You then provide a year and an annotation text for that country's plot.
3.  The code attempts to find the immigration value (`y_value_at_year`) for the country in the specified `annotate_year_int`.
4.  `plt.text(annotate_year_int, y_value_at_year + 500, annotation_text, ...)` places your text on the plot. The `y_value_at_year + 500` part positions the text slightly above the data point for that year to avoid overlap. A vertical dotted red line is also added at the specified year using `plt.axvline()`.
5.  If you chose 'India', year '1990', and text 'Economic Reforms', the plot would show India's immigration trend with 'Economic Reforms' annotated around 1990.

---
### Comparing Immigration from India and China
**(Original content: `df_CI = df_can.loc[['India', 'China'], years]`, then `df_CI.plot(kind='line')` which looks wrong, then `df_CI = df_CI.transpose()`, `df_CI.index = df_CI.index.map(int)`, and plotting again, is assumed)**

**NEW EXAMPLE: Comparing three countries with customized plot**

In [None]:
countries_to_compare = ['Pakistan', 'Philippines', 'United Kingdom']
df_compare_3 = df_can.loc[countries_to_compare, years].transpose()

# Convert index (years) to int for plotting
df_compare_3.index = df_compare_3.index.map(int)

# Plot
ax = df_compare_3.plot(kind='line', figsize=(12, 7)) # ax allows further customization

plt.title(f'Immigration Comparison: {", ".join(countries_to_compare)}')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')
plt.legend(title='Country')
ax.set_facecolor('lightgray') # Example: change background color of plot area
plt.show()

**Explanation:**
This example compares immigration from Pakistan, the Philippines, and the UK.
* `df_can.loc[countries_to_compare, years]` selects data for these countries across all specified years.
* `.transpose()` is crucial: it makes years the index (for the x-axis) and countries the columns (so each country gets its own line).
* `figsize=(12, 7)` makes the plot wider and taller.
* `ax = ...plot(...)` captures the plot's "Axes" object, allowing for finer control like `ax.set_facecolor()`.
* `plt.legend(title='Country')` adds a title to the legend box.

---
### Plotting Top 5 Countries
**(Original content: `df_can.sort_values(by='Total', ...)` , `df_top5 = df_can.head(5)`, `df_top5 = df_top5[years].transpose()`, plotting, is assumed)**

**INTERACTIVE EXERCISE: Plotting Top N countries chosen by the user**

In [None]:
try:
    num_countries = int(input("Enter the number of top countries (by total immigration) you want to plot (e.g., 3 or 7): "))
    if num_countries <= 0 or num_countries > 20: # Add a reasonable limit
        print("Please enter a number between 1 and 20.")
    else:
        # Sort by 'Total' (ensure 'Total' is numeric)
        df_can['Total'] = pd.to_numeric(df_can['Total'])
        df_can_sorted = df_can.sort_values(by='Total', ascending=False, axis=0)

        df_top_n = df_can_sorted.head(num_countries)

        # Transpose for plotting: years on x-axis, countries as separate lines
        df_top_n_plot = df_top_n[years].transpose()
        df_top_n_plot.index = df_top_n_plot.index.map(int) # Convert years index to int for plotting

        df_top_n_plot.plot(kind='line', figsize=(15, 8))

        plt.title(f'Immigration Trend of Top {num_countries} Countries')
        plt.ylabel('Number of Immigrants')
        plt.xlabel('Years')
        plt.legend(title='Country', bbox_to_anchor=(1.05, 1), loc='upper left') # Adjust legend position
        plt.tight_layout(rect=[0, 0, 0.85, 1]) # Adjust layout to make space for legend outside
        plt.show()

except ValueError:
    print("Invalid input. Please enter an integer.")

**Explanation:**
1.  You enter a number `N` (e.g., 3).
2.  The code sorts all countries by their 'Total' immigration in descending order.
3.  It then selects the top `N` countries using `head(num_countries)`.
4.  This `df_top_n` DataFrame is then prepared for plotting:
    * Only the yearly immigration data (`years` columns) is selected.
    * It's transposed (`.transpose()`) so that years become the index and each country becomes a column (resulting in a line for each country).
    * The year index is converted to integers for correct plotting on the x-axis.
5.  `df_top_n_plot.plot()` generates the line plot with each of the top `N` countries as a separate line.
6.  The legend is positioned outside the plot using `bbox_to_anchor` and `loc` for better readability if there are many lines. `plt.tight_layout()` helps fit it.
    If you entered 3, you'd see the immigration trends for the three countries that contributed the most immigrants to Canada overall.

---
### Other Plots (Brief Mention)
The notebook lists other plot types: `bar`, `barh`, `hist`, `box`, `kde`, `area`, `pie`, `scatter`, `hexbin`.

**NEW EXAMPLE: Simple Bar Plot for Total Immigration of a few countries**

In [None]:
# Let's select a few countries and plot their 'Total' immigration as a bar chart
countries_for_bar = ['India', 'China', 'Philippines', 'Pakistan', 'United Kingdom']
# Ensure 'Total' is numeric
df_can['Total'] = pd.to_numeric(df_can['Total'])

# Filter for these countries and their Total immigration
totals_subset = df_can.loc[countries_for_bar, 'Total']

totals_subset.plot(kind='bar', figsize=(10, 6), color=['skyblue', 'salmon', 'lightgreen', 'gold', 'orchid'])

plt.title(f'Total Immigration (1980-2013) for Selected Countries')
plt.xlabel('Country')
plt.ylabel('Total Number of Immigrants')
plt.xticks(rotation=45, ha='right') # Rotate country names for better readability
plt.grid(axis='y', linestyle='--') # Add horizontal grid lines
plt.tight_layout() # Adjust layout to prevent labels from overlapping
plt.show()

**Explanation:**
This example creates a bar chart showing the 'Total' immigration from a predefined list of five countries.
* `totals_subset = df_can.loc[countries_for_bar, 'Total']` creates a Pandas Series with countries as the index and their total immigration as values.
* `totals_subset.plot(kind='bar', ...)` generates the bar chart.
* `color=[...]` provides a list of colors for the bars.
* `plt.xticks(rotation=45, ha='right')` rotates the x-axis labels (country names) to prevent overlap.
* `plt.grid(axis='y', linestyle='--')` adds horizontal dashed grid lines for easier comparison of bar heights.

---
This completes the enhancements for the Matplotlib notebook. Now, for the questions.

## 50 Questions based on Covered Topics (Pandas & Matplotlib)

Here are 50 questions with hints, covering the topics from the enhanced notebooks:

**Pandas: Basics & DataFrame Inspection**

1.  **Q:** Which Pandas function is used to load data from an Excel file?
    * **Hint:** It's `pd.read_...`
2.  **Q:** How do you display the first 7 rows of a DataFrame `df`?
    * **Hint:** Use a method that shows the "head" of the DataFrame.
3.  **Q:** What method provides a concise summary of a DataFrame, including data types and non-null values?
    * **Hint:** It gives "info" about the DataFrame.
4.  **Q:** If `df.shape` returns `(100, 15)`, what does 100 represent?
    * **Hint:** Rows or columns?
5.  **Q:** How can you get a list of all column names from a DataFrame `df`?
    * **Hint:** Access the `.columns` attribute and convert it.
6.  **Q:** What is the default data type of `df.index` before any conversion?
    * **Hint:** Is it a Python list or a Pandas-specific object?
7.  **Q:** How do you get the number of non-null values for each column in `df`?
    * **Hint:** Combine `isnull()` with another aggregation method. (Alternatively, `info()` shows this).
8.  **Q:** Which method is used to get descriptive statistics (mean, std, min, max, etc.) for numeric columns in `df`?
    * **Hint:** It "describes" the data.
9.  **Q:** To see descriptive statistics for non-numeric (object) columns, what argument do you pass to `describe()`?
    * **Hint:** `include='...'`
10. **Q:** What does `df.size` return?
    * **Hint:** Total number of ... in the DataFrame.

**Pandas: Data Cleaning & Manipulation**

11. **Q:** How do you remove a column named 'OldColumn' from `df` permanently?
    * **Hint:** Use `df.drop()` with `axis=1` and an argument for permanency.
12. **Q:** What is the syntax to rename a column 'OldName' to 'NewName' in `df`?
    * **Hint:** `df.rename(columns={'OldName': ...})`
13. **Q:** How can you create a new column 'DecadeTotal' in `df` by summing columns 'Year1', 'Year2', and 'Year3'?
    * **Hint:** `df['DecadeTotal'] = df[['Year1', ...]].sum(axis=...)`
14. **Q:** After setting 'Country' as the index for `df_can`, what command would remove this index and revert to a default integer index?
    * **Hint:** The opposite of `set_index()`.
15. **Q:** If you want to convert all column names in `df` to string type, and they are currently a mix of integers and strings, what Python function can be used with `map`?
    * **Hint:** `str`

**Pandas: Indexing, Selection & Filtering**

16. **Q:** To select the column 'Population' from `df`, which of these is generally more robust: `df.Population` or `df['Population']`?
    * **Hint:** One fails if the column name has spaces.
17. **Q:** How do you select rows for 'India' and 'China' and columns for years '1990' through '1995' (inclusive) from `df_can` (where 'Country' is index and year columns are strings)?
    * **Hint:** Use `.loc` with lists and slices.
18. **Q:** If `df` has a default integer index, how do you select the first 3 rows?
    * **Hint:** Use `.iloc` with a slice.
19. **Q:** How do you select all rows where the 'Continent' column is 'Asia' in `df_can`?
    * **Hint:** `df_can[df_can['Continent'] == ...]`
20. **Q:** To filter `df_can` for countries where 'Region' is 'Western Europe' AND immigration in '2013' (numeric column) was greater than 5000, what is the correct way to combine conditions?
    * **Hint:** Use `&` and parentheses for each condition.
21. **Q:** What does `df_can.loc['Germany']` return if 'Germany' is a valid index label?
    * **Hint:** A Series or DataFrame?
22. **Q:** How do you select the data for 'Japan' in the year '2013' using `.loc`?
    * **Hint:** `df_can.loc[row_label, column_label]`
23. **Q:** If you want to select all columns from '1990' to '2000' (inclusive, string names) for all countries, how would you use `.loc`?
    * **Hint:** `df_can.loc[:, '1990':'2000']`
24. **Q:** What is the purpose of `df_can.columns.get_loc('ColumnName')`?
    * **Hint:** It helps find the integer ... of a column.
25. **Q:** How do you filter `df_can` to show only countries whose 'Continent' is one of ['Africa', 'Oceania']?
    * **Hint:** Use the `.isin()` method within a boolean condition.

**Pandas: Sorting**

26. **Q:** How do you sort `df_can` by the 'Total' column in descending order?
    * **Hint:** `df_can.sort_values(by=..., ascending=...)`
27. **Q:** To sort `df_can` first by 'Continent' (ascending) and then by 'Region' (ascending), how do you specify this in `sort_values()`?
    * **Hint:** Pass a list to the `by` parameter.
28. **Q:** What does the `inplace=True` argument do in methods like `sort_values()` or `rename()`?
    * **Hint:** Modifies the DataFrame directly or returns a new one?

**Matplotlib: Basics & Line Plots**

29. **Q:** What is the common alias for `matplotlib.pyplot`?
    * **Hint:** `import matplotlib.pyplot as ...`
30. **Q:** What command ensures Matplotlib plots are displayed directly in the Jupyter Notebook output cell?
    * **Hint:** A "magic" command starting with `%`.
31. **Q:** If `haiti` is a Pandas Series (with years as index and immigrant numbers as values), what is the simplest command to generate a line plot?
    * **Hint:** `series_name.plot()`
32. **Q:** Which `pyplot` function is used to set the title of a plot?
    * **Hint:** `plt.title(...)`
33. **Q:** How do you label the x-axis as 'Years' and the y-axis as 'Number of People'?
    * **Hint:** `plt.xlabel(...)` and `plt.ylabel(...)`
34. **Q:** What function displays the plot after all customizations?
    * **Hint:** `plt.show()`
35. **Q:** If you are plotting a Pandas Series `s` and its index contains string representations of years (e.g., '1980', '1981'), what should you ideally do to the index before plotting for a numerically correct x-axis?
    * **Hint:** Convert the index to `int` type using `.map(int)`.
36. **Q:** How do you add text annotation 'Event X' at coordinates (year 2000, value 5000) on a plot?
    * **Hint:** `plt.text(x, y, 'string')`
37. **Q:** If `df_CI` has countries as index and years as columns, why is `df_CI.transpose()` needed before plotting to get lines for each country over the years?
    * **Hint:** Pandas plots columns as separate lines by default. Transposing makes countries into columns.
38. **Q:** When plotting multiple lines from a DataFrame (e.g., `df_top5_plot.plot(kind='line')`), how does Matplotlib typically create the legend automatically?
    * **Hint:** Based on DataFrame column names.
39. **Q:** What parameter in the `.plot()` method controls the size of the figure (e.g., to make it 10 inches wide and 6 inches tall)?
    * **Hint:** `figsize=(width, height)`
40. **Q:** How can you change the line color to 'red' and linestyle to dashed (`--`) when plotting a Series `s`?
    * **Hint:** `s.plot(kind='line', color=..., linestyle=...)`
41. **Q:** What `pyplot` function can add a vertical line at x=2010 on your plot?
    * **Hint:** `plt.axvline(...)`
42. **Q:** How do you add a grid to your Matplotlib plot?
    * **Hint:** `plt.grid(...)`

**Matplotlib: Other Plot Types & Customization**

43. **Q:** To create a vertical bar plot of a Series `s`, what `kind` argument do you pass to `s.plot()`?
    * **Hint:** `kind='...'`
44. **Q:** For a bar chart with country names on the x-axis, what `pyplot` function can help improve readability if names overlap?
    * **Hint:** `plt.xticks(rotation=...)`
45. **Q:** What is the purpose of `plt.legend()`?
    * **Hint:** Displays labels for different lines or elements on the plot.
46. **Q:** If you get an Axes object `ax = df.plot()`, how might you change the background color of the plot area itself?
    * **Hint:** `ax.set_facecolor(...)`
47. **Q:** What does `plt.tight_layout()` attempt to do?
    * **Hint:** Adjusts plot parameters for a "tight" fit of elements.

**Conceptual Questions**

48. **Q:** Why is it generally good practice to convert year column names or index values to a numeric type (like `int`) before creating line plots where years are on an axis?
    * **Hint:** Ensures correct spacing and interpretation of the axis as continuous.
49. **Q:** When would a line plot be more appropriate than a bar chart for visualizing immigration data over time for a single country?
    * **Hint:** Line plots are good for showing trends over continuous intervals.
50. **Q:** If you have a DataFrame where rows represent different products and columns represent monthly sales, and you want to plot each product's sales trend over time, what is the key step before calling `.plot()`?
    * **Hint:** Similar to the countries/years problem; you likely need to make products the columns.

---

This comprehensive response should significantly enhance the learning experience for users of these notebooks. Remember to execute the Python code cells in the Jupyter environment to see the interactive prompts and outputs.