<a href="https://colab.research.google.com/github/comparativechrono/Principles-of-Data-Science/blob/main/Week_2/Section_6_Python_Example__Panel_Data_Analysis_with_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Section 6 Python example - panel data analysis with pandas

Panel data analysis provides insights into dynamics that are not observable in purely cross-sectional or time series datasets. This kind of analysis is crucial in various fields such as economics, epidemiology, and social sciences. In this example, we will demonstrate how to handle and analyse panel data using Python, particularly with the Pandas library, a powerful tool for data manipulation and analysis.

1. Setting Up the Environment:

Ensure that Python and Pandas are installed in your environment. If Pandas is not installed, you can easily install it using pip:

In [None]:
pip install pandas

2. Importing Required Libraries:

Start by importing Pandas. We'll also import Matplotlib for any necessary visualizations:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

3. Creating a Synthetic Panel Dataset:

For the purpose of this example, let's create a synthetic dataset representing sales data over three years for different stores:

In [None]:
# Generate synthetic data
data = {
    'Year': ['2019', '2019', '2019', '2020', '2020', '2020', '2021', '2021', '2021'],
    'Store': ['Store_1', 'Store_2', 'Store_3', 'Store_1', 'Store_2', 'Store_3', 'Store_1', 'Store_2', 'Store_3'],
    'Sales': [200, 150, 300, 250, 180, 320, 300, 190, 350]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Convert 'Year' to datetime type
df['Year'] = pd.to_datetime(df['Year'])

# Display the DataFrame
print(df)

4. Visualizing the Data:

It’s beneficial to visualize panel data to understand trends over time across different entities.

In [None]:
# Plot sales over time for each store
for label, grp in df.groupby('Store'):
    plt.plot(grp['Year'], grp['Sales'], label=label)

plt.title('Sales Over Time by Store')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.legend()
plt.show()

5. Performing Panel Data Analysis:

For panel data analysis, we can calculate growth rates or changes over time to understand trends:

In [None]:
# Calculate year-over-year growth for each store
df['Year'] = df['Year'].dt.year  # simplify the year for easier subtraction
df.sort_values(by=['Store', 'Year'], inplace=True)

# Group by 'Store' and calculate the percentage change year over year
df['Sales Growth'] = df.groupby('Store')['Sales'].pct_change() * 100

# Fill NaN values that result from no previous year data with zero
df['Sales Growth'] = df['Sales Growth'].fillna(0)

# Display the modified DataFrame
print(df)

6. Insights and Further Analysis:

The growth rates provide clear insights into how sales are evolving for each store. For deeper analysis, one could implement statistical tests to determine if the changes are statistically significant or use advanced time series techniques to forecast future sales. You may want to revisit this dataset towards the end of this course.

Conclusion:

Handling panel data in Python using Pandas is straightforward thanks to the library's robust data manipulation capabilities. By transforming, aggregating, and visualizing panel data effectively, researchers and analysts can uncover significant trends and interactions that inform strategic decisions. Whether for academic research or business analysis, the ability to parse through panel data efficiently is invaluable in deriving actionable insights from complex datasets spanning multiple dimensions.