# D-Lab GitHub Copilot Workshop - Python Solutions

In [None]:
# Load necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Read in data
gap = pd.read_csv("data/gapminder.csv")
gap.head()

### 🥊 Challenge 1:
Open the chat, highlight the code in the cell below, and ask Copilot to explain it by typing a question into the chat (e.g., "Explain this code, please"), no need to copy the code in, highlighting is enough.

In [None]:
gap.groupby(['continent', 'year'])['lifeExp'].mean().reset_index().pivot(index='continent', columns='year', values='lifeExp')

This is the chat response we got. Note that your chat response may be different:

"This code snippet performs a series of data manipulation tasks on the gap dataset using pandas. Here is a step-by-step explanation:

1. `gap.groupby(['continent', 'year'])['lifeExp']`: This groups the data by continent and year, then selects the lifeExp column for aggregation.

2. `.mean()`: This calculates the mean life expectancy for each group (each combination of continent and year).

3. `.reset_index()`: This converts the grouped data back to a regular DataFrame with continent, year, and lifeExp columns.

4. `.pivot(index='continent', columns='year', values='lifeExp')`: This reshapes the data from long format to wide format. Specifically:
   - `index='continent'`: Each continent becomes a row
   - `columns='year'`: Each unique year becomes a column
   - `values='lifeExp'`: The mean life expectancy values fill the cells

**Example Output**
The resulting DataFrame will have:
- One row for each continent
- One column for each year containing the mean life expectancy for that year and continent

This transformation makes it easier to compare mean life expectancy across different years for each continent."

Copilot can also help us debug errors in our code. Below I have written some code to create a histogram of the "gdpPercap" column in the `gap` DataFrame. However, I have made three mistakes in the code. Can you find them? Highlight the code below and ask Copilot to help you debug it.

In [None]:
plt.figure(figsize=(10, 6))
plt.hist(gap['gdpPercap'], bins=30, alpha=0.7)
plt.axhline(y=gap['gdpPercap'].mean(), color='red', linestyle='--')  # Should be axvline for vertical line
plt.xlabel('GDP per Capita')
plt.title('Histogram of GDP per Capita')
plt.colorbar()  # This doesn't make sense for a histogram
plt.show()

💡 **Tip**: Copilot may suggest various improvements when debugging. This kind of "conversation" is an important part of working with Copilot.

If you hover your cursor over the code chunk generated by Copilot you will see in the upper-right hand corner of the cell some buttons including (1) a `Copy` button that you can click to copy the code to your clipboard and (2) an `Insert at cursor` button that you can click to insert the code at the cursor in your open file.

Copilot corrected code:

In [None]:
plt.figure(figsize=(10, 6))
plt.hist(gap['gdpPercap'], bins=30, alpha=0.7, color='skyblue', edgecolor='black')
plt.axvline(x=gap['gdpPercap'].mean(), color='red', linestyle='--', label='Mean')  # Corrected to axvline
plt.xlabel('GDP per Capita')
plt.ylabel('Frequency')  # Added y-label
plt.title('Histogram of GDP per Capita')
plt.legend()  # Added legend instead of colorbar
plt.show()

### 🥊 Challenge 2:
Highlight this matplotlib code, press `Ctrl + I`, and ask Copilot to change the background of the plot to your favorite color (e.g. "purple"), increase the text size of the x and y-axis titles to size 14, and make the title center aligned. This is super helpful for when you can't remember what the exact syntax is for a specific matplotlib element.

In [None]:
plt.figure(figsize=(10, 6))
plt.hist(gap['gdpPercap'], bins=30, color='blue', edgecolor='black')
plt.xlabel('GDP per Capita', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.title('Histogram of GDP per Capita', ha='center', fontsize=16)
plt.gca().set_facecolor('lavender')  # Changed background color
plt.show()

### In-line suggestions
You may have noticed throughout this workshop that Copilot provides suggestions as you type in a lighter gray text. These suggestions can be helpful for completing code quickly. You can accept a suggestion by pressing `Tab`.

Suggestions are automatically triggered by the code you write, based on context from the code you have already written.

### 🥊 Challenge 3:
Start typing `unique_countries = gap['country'].` and see what suggestions Copilot gives you. Accept the suggestion that completes the code to create a list of unique countries in the `gap` DataFrame.

In [None]:
unique_countries = gap['country'].unique()

### 🥊 Challenge 4:
You can also trigger suggestions by providing comments in your code. These comments allow you to give more detailed information and context for the in-line suggestions. Start typing `# Calculate average gdpPercap and lifeExp grouped by continent for the year 2007` and see what suggestions Copilot gives you.

💡 **Tip**: You may have to start writing `gap_2007 = gap` to trigger the suggestions.

In [None]:
# Calculate average gdpPercap and lifeExp grouped by continent for the year 2007
gap_2007 = gap[gap['year'] == 2007].groupby('continent')[['gdpPercap', 'lifeExp']].mean().reset_index()
gap_2007

### 🥊 Challenge 5:

Copilot can be really useful for cleaning up your code. For example, we have code below that is a bit untidy (missing proper spacing, no comments, long single lines of code). Highlight the code below and in the chat ask Copilot to format this code and add comments.

In [None]:
gap_asia=gap[gap['continent']=='Asia'].groupby('year')['lifeExp'].mean().reset_index()
plt.figure(figsize=(8,6))
plt.plot(gap_asia['year'],gap_asia['lifeExp'])
plt.xlabel('Year')
plt.ylabel('Life Expectancy')
plt.title('Average Life Expectancy in Asia by Year')
plt.show()

gap_europe=gap[gap['continent']=='Europe'].groupby('year')['lifeExp'].mean().reset_index()
plt.figure(figsize=(8,6))
plt.plot(gap_europe['year'],gap_europe['lifeExp'])
plt.xlabel('Year')
plt.ylabel('Life Expectancy')
plt.title('Average Life Expectancy in Europe by Year')
plt.show()

gap_oceania=gap[gap['continent']=='Oceania'].groupby('year')['lifeExp'].mean().reset_index()
plt.figure(figsize=(8,6))
plt.plot(gap_oceania['year'],gap_oceania['lifeExp'])
plt.xlabel('Year')
plt.ylabel('Life Expectancy')
plt.show()

You may notice that you have to do some extra work to complete the clean-up; for example, you may want to remove unnecessary or incorrect comments or add titles that are missing.

In [None]:
# Calculate the mean life expectancy for Asia by year
gap_asia = gap[gap['continent'] == 'Asia'].groupby('year')['lifeExp'].mean().reset_index()

# Plot the mean life expectancy for Asia by year
plt.figure(figsize=(8, 6))
plt.plot(gap_asia['year'], gap_asia['lifeExp'])
plt.xlabel('Year')
plt.ylabel('Life Expectancy')
plt.title('Average Life Expectancy in Asia by Year')
plt.show()

# Calculate the mean life expectancy for Europe by year
gap_europe = gap[gap['continent'] == 'Europe'].groupby('year')['lifeExp'].mean().reset_index()

# Plot the mean life expectancy for Europe by year
plt.figure(figsize=(8, 6))
plt.plot(gap_europe['year'], gap_europe['lifeExp'])
plt.xlabel('Year')
plt.ylabel('Life Expectancy')
plt.title('Average Life Expectancy in Europe by Year')
plt.show()

# Calculate the mean life expectancy for Oceania by year
gap_oceania = gap[gap['continent'] == 'Oceania'].groupby('year')['lifeExp'].mean().reset_index()

# Plot the mean life expectancy for Oceania by year
plt.figure(figsize=(8, 6))
plt.plot(gap_oceania['year'], gap_oceania['lifeExp'])
plt.xlabel('Year')
plt.ylabel('Life Expectancy')
plt.title('Average Life Expectancy in Oceania by Year')
plt.show()

We can take this clean-up a step further and ask it to make a function from our repetitive code. Highlight the code above and ask Copilot to turn it into a function.

In [None]:
def plot_life_expectancy(continent_name):
    """
    Function to calculate and plot mean life expectancy by year for a given continent
    
    Parameters:
    continent_name (str): Name of the continent to plot
    """
    # Calculate the mean life expectancy for the given continent by year
    gap_continent = gap[gap['continent'] == continent_name].groupby('year')['lifeExp'].mean().reset_index()
    
    # Plot the mean life expectancy for the given continent by year
    plt.figure(figsize=(8, 6))
    plt.plot(gap_continent['year'], gap_continent['lifeExp'], marker='o')
    plt.xlabel('Year')
    plt.ylabel('Life Expectancy')
    plt.title(f'Average Life Expectancy in {continent_name} by Year')
    plt.grid(True, alpha=0.3)
    plt.show()

# Call the function for Asia
plot_life_expectancy("Asia")

# Call the function for Europe
plot_life_expectancy("Europe")

# Call the function for Oceania
plot_life_expectancy("Oceania")

### 🥊 Challenge 6:

Example response from GitHub Copilot when prompted to "determine the drivers of life expectancy":

"To determine the drivers of life expectancy, you can perform a regression analysis using relevant predictors such as GDP per capita, population, year, etc. Here's a step-by-step plan and the corresponding Python code to perform this analysis:

**Step-by-Step Plan:**
1. Load necessary libraries: Use pandas for data manipulation, sklearn for machine learning, and statsmodels for detailed statistical analysis.
2. Load the dataset: Ensure the dataset contains life expectancy and potential predictors.
3. Prepare the data: Handle categorical variables and check for missing values.
4. Fit a linear regression model: Use sklearn or statsmodels to fit the model with life expectancy as the dependent variable.
5. Summarize the model: Get model statistics and coefficients.
6. Interpret the results: Identify significant predictors based on p-values and coefficients.

**Python Code:**

In [None]:
# Load necessary libraries
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import r2_score
import statsmodels.api as sm
import pandas as pd

# Prepare the data
# Encode categorical variables
le_country = LabelEncoder()
le_continent = LabelEncoder()

gap_encoded = gap.copy()
gap_encoded['country_encoded'] = le_country.fit_transform(gap['country'])
gap_encoded['continent_encoded'] = le_continent.fit_transform(gap['continent'])

# Select features and target
features = ['year', 'pop', 'gdpPercap', 'continent_encoded', 'country_encoded']
X = gap_encoded[features]
y = gap_encoded['lifeExp']

# Fit the model using statsmodels for detailed statistics
X_with_const = sm.add_constant(X)
model = sm.OLS(y, X_with_const).fit()

# Print the model summary
print(model.summary())

# Alternative: Use sklearn for a simpler approach
sklearn_model = LinearRegression()
sklearn_model.fit(X, y)

print("\n=== Sklearn Model Results ===")
print(f"R-squared: {r2_score(y, sklearn_model.predict(X)):.4f}")
print("\nCoefficients:")
for feature, coef in zip(features, sklearn_model.coef_):
    print(f"{feature}: {coef:.4f}")
print(f"Intercept: {sklearn_model.intercept_:.4f}")

**Explanation:**
- **Libraries**: pandas for data manipulation, sklearn for machine learning, and statsmodels for detailed statistical analysis.
- **Data Preparation**: Categorical variables (country, continent) are encoded numerically using LabelEncoder.
- **Model**: Uses both statsmodels OLS for detailed statistics and sklearn LinearRegression for simpler output.
- **Features**: year, population, GDP per capita, and encoded categorical variables as predictors.
- **Output**: The model summary shows coefficients, p-values, R-squared, and other statistical measures.

This analysis will help you identify which predictors significantly influence life expectancy. However, as demonstrated in the main workshop, it's crucial to validate these results through proper model diagnostics and visualization."