In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
from sklearn import preprocessing

dataset = pd.read_csv('uaScoresDataFrame.csv')

print('First few rows of the dataset:')
print(dataset.head())
print("\nData types of each column:")
print(dataset.dtypes)
print("\nSummary statistics:")
print(dataset.describe())
print("\nMissing values per column:")
print(dataset.isnull().sum())

First few rows of the dataset:
   Unnamed: 0      UA_Name    UA_Country   UA_Continent  Housing  \
0           0       Aarhus       Denmark         Europe   6.1315   
1           1     Adelaide     Australia        Oceania   6.3095   
2           2  Albuquerque    New Mexico  North America   7.2620   
3           3       Almaty    Kazakhstan           Asia   9.2820   
4           4    Amsterdam   Netherlands         Europe   3.0530   

   Cost of Living  Startups  Venture Capital  Travel Connectivity  Commute  \
0           4.015    2.8270            2.512               3.5360  6.31175   
1           4.692    3.1365            2.640               1.7765  5.33625   
2           6.059    3.7720            1.493               1.4555  5.05575   
3           9.333    2.4585            0.000               4.5920  5.87125   
4           3.824    7.9715            6.107               8.3245  6.11850   

   ...  Safety  Healthcare  Education  Environmental Quality  Economy  \
0  ...  9.6165    

In [1]:
# Drop non-numeric identifier columns
dataset_num = dataset.drop(columns=['Unnamed: 0', 'UA_Name', 'UA_Country', 'UA_Continent'])
print("\nColumns used for analysis:")
print(dataset_num.columns)  # Expecting 17 numeric columns
scaler = MinMaxScaler()
scaled_values = scaler.fit_transform(dataset_num)
dataset_minmax = pd.DataFrame(scaled_values, columns=dataset_num.columns)

# Verify the normalized data
print("Summary statistics of normalized data:")
print(dataset_minmax.describe())
# Transpose the data so that each row is one variable (17 rows, 266 columns)
scaled_data = preprocessing.scale(dataset_num.T)
print("Shape after transposing and scaling:", scaled_data.shape)

# Perform PCA on the transposed data (analyzing the structure among the 17 variables)
pca = PCA()
pca_fit = pca.fit(scaled_data)

# Explained variance ratio and scree plot
explained_variance = pca_fit.explained_variance_ratio_
print("\nExplained Variance Ratio (per component):")
print(explained_variance)

cumulative_variance = np.cumsum(explained_variance)
plt.figure(figsize=(8, 5))
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o', linestyle='--')
plt.xlabel("Principal Component")
plt.ylabel("Cumulative Explained Variance")
plt.title("Scree Plot")
plt.grid(True)
plt.show()

# Instead of using pca_fit.components_ (which has shape (17, 266)), we get the PCA scores for the 17 variables
# These scores represent how each original variable (row) is represented in the PCA space.
variable_scores = pca_fit.transform(scaled_data)  # shape should be (17, 17)

# Create a DataFrame for these scores with rows labeled by the original variable names and columns as PC1, PC2, ...
pca_loadings = pd.DataFrame(
    variable_scores,
    index=dataset_num.columns,  # 17 original numeric indicator names
    columns=[f"PC{i+1}" for i in range(variable_scores.shape[1])]
)
print("\nPCA Component Loadings (Variable Scores):")
print(pca_loadings)


NameError: name 'dataset' is not defined

In [None]:


1. **Dataset Overview**:
    - The dataset contains information about various cities, including their scores on different indices such as Housing, Cost of Living, Startups, Venture Capital, Travel Connectivity, Commute, Business Freedom, Safety, Healthcare, Education, Environmental Quality, Economy, Taxation, Internet Access, Leisure & Culture, Tolerance, and Outdoors.
    - The dataset has 266 rows and 21 columns.

2. **Correlation Matrix**:
    - The correlation matrix (`corr_matrix`) shows the correlation between different indices such as Stability, Healthcare, Culture & Environment, Education, and Infrastructure.
    - The values range from -1 to 1, indicating the strength and direction of the relationships between the indices.

3. **Principal Component Analysis (PCA)**:
    - PCA has been performed on the dataset, and the explained variance and cumulative variance are provided.
    - The explained variance shows the amount of variance captured by each principal component.
    - The cumulative variance indicates the total variance captured by the principal components up to a certain point.

4. **Indices and Weights**:
    - The dataset includes various indices such as Basic_Needs_Index, Economic_Opportunity_Index, Mobility_Infrastructure_Index, Quality_of_Life_Index, and Composite_Index.
    - Weights for different indices are provided, indicating the importance of each factor in calculating the respective indices.

5. **Missing Values**:
    - The dataset appears to have no missing values, as indicated by the `isnull().sum()` output showing zero missing values for each column.

6. **Data Types**:
    - The dataset contains columns with different data types, including float64 for numerical values and object for categorical values.

7. **Summary Statistics**:
    - Summary statistics such as mean, standard deviation, minimum, and maximum values are provided for each column in the dataset.

These observations provide a comprehensive overview of the dataset and its characteristics, which can be useful for further analysis and modeling.