# North American Universities Analysis

This notebook analyzes the Top 200 universities in the United States and Canada, focusing on various aspects such as:
- Educational quality rankings
- Financial resources (endowment, tuition)
- Academic resources (staff, library)
- Historical context (establishment dates)
- Student demographics

In [None]:
# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats

# Set style for better visualizations
plt.style.use('seaborn')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = [15, 10]
plt.rcParams['font.size'] = 10

## 1. Data Loading and Cleaning

Let's load our dataset and perform necessary cleaning operations:

In [None]:
# Read and clean the data
df = pd.read_csv('NorthAmericaUniversities.csv', encoding='latin1')

# Clean the data
df['Name'] = df['Name'].str.strip()
df['Country'] = df['Country'].str.lower().str.strip()
df['Minimum Tuition cost'] = df['Minimum Tuition cost'].str.replace('"', '').str.replace('$', '').str.replace(',', '').astype(float)
df['Endowment'] = df['Endowment'].str.replace('$', '').str.replace('B', '').astype(float)
df['Age'] = 2024 - df['Established']
df['Student_Staff_Ratio'] = df['Number of Students'] / df['Academic Staff']

# Display basic information about the dataset
print("Dataset Info:")
print(df.info())

print("\nFirst few rows:")
df.head()

## 2. Basic Statistical Analysis

Let's examine the basic statistics of our numerical variables:

In [None]:
# Display summary statistics
print("Summary Statistics:")
df.describe()

# Display correlation matrix
plt.figure(figsize=(12, 10))
numeric_columns = ['Age', 'Academic Staff', 'Number of Students', 
                  'Minimum Tuition cost', 'Volumes in the library', 
                  'Endowment', 'Student_Staff_Ratio']
correlation_matrix = df[numeric_columns].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap of University Metrics')
plt.tight_layout()

## 3. Historical Analysis

Let's analyze how university characteristics vary across different time periods:

In [None]:
# Create time period analysis
establishment_periods = pd.cut(df['Established'], 
                             bins=[1600, 1700, 1800, 1850, 1900, 1950, 2000],
                             labels=['1600s', '1700s', '1800-1850', '1850-1900', '1900-1950', '1950-2000'])

period_stats = df.groupby(establishment_periods).agg({
    'Endowment': 'mean',
    'Minimum Tuition cost': 'mean',
    'Number of Students': 'mean'
}).round(2)

# Plot the results
period_stats.plot(kind='bar', subplots=True, layout=(3,1), figsize=(15, 12))
plt.tight_layout()
plt.show()

print("\nAverage metrics by establishment period:")
print(period_stats)

## 4. Size-Based Analysis

Let's analyze how various metrics vary with university size:

In [None]:
# Create size categories
df['Size_Category'] = pd.qcut(df['Number of Students'], q=4, 
                             labels=['Small', 'Medium', 'Large', 'Very Large'])

# Create subplots for different metrics
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Plot different metrics by size
sns.boxplot(data=df, x='Size_Category', y='Endowment', ax=axes[0,0])
axes[0,0].set_title('Endowment by University Size')
axes[0,0].tick_params(axis='x', rotation=45)

sns.boxplot(data=df, x='Size_Category', y='Minimum Tuition cost', ax=axes[0,1])
axes[0,1].set_title('Tuition Cost by University Size')
axes[0,1].tick_params(axis='x', rotation=45)

sns.boxplot(data=df, x='Size_Category', y='Volumes in the library', ax=axes[1,0])
axes[1,0].set_title('Library Resources by University Size')
axes[1,0].tick_params(axis='x', rotation=45)

sns.boxplot(data=df, x='Size_Category', y='Student_Staff_Ratio', ax=axes[1,1])
axes[1,1].set_title('Student-Staff Ratio by University Size')
axes[1,1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

# Print size category statistics
print("\nAverage metrics by university size:")
print(df.groupby('Size_Category').agg({
    'Number of Students': 'mean',
    'Endowment': 'mean',
    'Minimum Tuition cost': 'mean',
    'Student_Staff_Ratio': 'mean'
}).round(2))

## 5. Country Comparison

Let's analyze the differences between US and Canadian universities:

In [None]:
# Create country comparison visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

sns.violinplot(data=df, x='Country', y='Age', ax=axes[0,0])
axes[0,0].set_title('Age Distribution by Country')

sns.violinplot(data=df, x='Country', y='Student_Staff_Ratio', ax=axes[0,1])
axes[0,1].set_title('Student-Staff Ratio by Country')

sns.violinplot(data=df, x='Country', y='Endowment', ax=axes[1,0])
axes[1,0].set_title('Endowment Distribution by Country')

sns.violinplot(data=df, x='Country', y='Minimum Tuition cost', ax=axes[1,1])
axes[1,1].set_title('Tuition Cost Distribution by Country')

plt.tight_layout()
plt.show()

# Print country comparison statistics
print("\nDetailed country comparison:")
print(df.groupby('Country').agg({
    'Age': ['mean', 'min', 'max'],
    'Endowment': ['mean', 'min', 'max'],
    'Minimum Tuition cost': ['mean', 'min', 'max'],
    'Student_Staff_Ratio': ['mean', 'min', 'max']
}).round(2))

## 6. Top Universities Analysis

Let's look at the leading universities in different categories:

In [None]:
print("Top 5 Universities by Endowment:")
print(df.nlargest(5, 'Endowment')[['Name', 'Endowment', 'Country']])

print("\nTop 5 Universities by Student Population:")
print(df.nlargest(5, 'Number of Students')[['Name', 'Number of Students', 'Country']])

print("\nTop 5 Universities by Library Resources:")
print(df.nlargest(5, 'Volumes in the library')[['Name', 'Volumes in the library', 'Country']])

print("\nOldest Universities:")
print(df.nsmallest(5, 'Established')[['Name', 'Established', 'Country']])

## 7. Key Findings and Conclusions

1. **Historical Patterns**:
   - Older universities (1600s-1700s) tend to have larger endowments and higher tuition costs
   - Modern universities have larger student populations but smaller endowments

2. **Country Differences**:
   - US universities have higher average endowments ($3.85B vs $0.72B)
   - Canadian universities have lower tuition costs but larger student populations
   - US universities show more variation in resources and size

3. **Size and Resources**:
   - Larger universities don't necessarily have larger endowments
   - Student-staff ratios increase with university size
   - Library resources don't scale linearly with size

4. **Notable Correlations**:
   - Strong positive correlation between endowment and library volumes (0.67)
   - Moderate correlation between age and endowment (0.48)
   - Weak correlation between student population and endowment (-0.05)

5. **Leading Institutions**:
   - Harvard leads in both endowment ($50.7B) and library resources (14.4M volumes)
   - Grand Canyon University has the largest student population (101,816)
   - Oldest institution is Harvard University (established 1636)