<a href="https://colab.research.google.com/github/cedamusk/AI-N-ML/blob/main/Linear_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Linear Regression

**Link to used dataset**https://drive.google.com/file/d/1juE760yG7wkcDXIQ38AKVhOGPlH2oeej/view?usp=sharing
## Import Libraries and Modules
1.  `import pandas as pd`: Imports the `pandas` library and assigns it the alias `pd`. `pandas` is used for data manipulation and analysis. Commonly, it works with data structures like DataFrames and Series.

2.  `import numpy as np`: Imports the `numpy` library and assigns it the alias `np`. `numpy` library is used for numerical computation in Python, particularly for working arrays and performing mathematical operations.

3.  `from sklearn.linear_model import LinearRegression`. Imports the `LinearRegression` class from `sklearn.linear_model`. This class provides a linear regression model that fits a straight line to the data. It's part of the `scikit-learn` library.

4.  `from sklearn.metrics import mean_absolute_error, mean_squared_error`. Imports `mean_absolute_error` and `mean_squared_error` from the `sklearn.metrics` module. `mean_absolute_error` calculates the average of absolute errors between predicted and actual values. `mean_squared_error` calculates the average of squared errors, often used to penalize larger errors more heavily.

5.  `import matplotlib.pyplot as plt`. Imports the `pyplot` module from the `matplotlib` library and assigns it the alias `plt`. Used for creating visualizations such as line plots, scatter plots and histograms.

6. `from scipy import stats`. Imports the `stats` module from `scipy` library. Provides statistical functions and methods, such as computing correlation coefficients or performin hypothesis testing.





In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
import matplotlib.pyplot as plt
from scipy import stats

In [None]:
data=pd.read_csv('/content/synthetic_renewable_energy_analysis.csv')

In [None]:
print("First few rows:")
print(data.head())
print("\nColumn names:")
print(data.columns)

## Analyze Renewable energy and GDP correlation
This function analyzes the relationship between renewable energy and GDP growth rate for each country in the dataset. It performs linear regression ad computes several statistical metrics.

## Code Breakdown
1.  `results={}`: Initializes an empty dictionary to store the analysis results for each country.

2. `for country in data['Country'].unique():`: Loops through each unique country in the `Country` column of the dataset. It analyzes data for each country separately.

3. `country_data=data[data['Country']==country]`: Filters the dataset to include only rows corresponding to the current `country`.

4. `X=country_data['Renewable_Energy_Share (%)'].values.reshape(-1,1)`: Extracts the independent variable (renewable energy share) as a NumPy array. The `.reshape(-1, 1)` ensures that `X` has the correct 2D shape required for the regression model.

5. `y=country_data['GDP_Growth_Rate (%)'].values`: Extracts the dependent variable (GDP Growth) as a NumPy array.

6. `model=LinearRegression`: Initializes a linear regression model from `sklearn`.

7. `model.fit(X, y)`: Trains the linear regression model usin`X` (independent variable) and `y` (dependent variable).

8. `y_pred=model.predict(X)`: Predicts the Gdp growth rate (`y_pred`) based on the renewable energy share (`X`) using the trained model.

##Perfomance Metrics
9. `mae=mean_absolute_error(y, y_pred)`: Calculates the Mean Absolute Error (MAE), which measures the average absolute difference between predicted and actual values.

10. `mse=mean_squared_error(y, y_pred)`: Caculates the Mean Squared Error (MSE), which measures the average squared differences between predicted and actual values.

11. `rmse=np.sqrt(mse)`: Calculates the Root Mean Squared Error (RMSE), which is the square root of MSE and has the same units as the dependent variable.

12. `r_squared=model.score(X, y)`: Computes the R-Squared value, which indicates how well the model explains the variance in the data.

13. `adjusted_r_squared=1-(1-r_squared)*(len(y)-1) / (len(y)-X.shape[1]-1)`: Adjusts the R-squared value to account for the number of predictors (features). It prevents overestimation in models with multiple variables.

##Correlation Analysis
14. `correlation_coef, p_value=stats.pearsonr(...)`: Computes the Pearson correlation coefficient and p-value for the relationship between renewable energy share and GDP growth rate. `correlation_coef` Measures the strength and direction of the linear relationship.
`p_value` indicates statistical significance (whether the relationship is likely due to chance).

##Store results
15. `results[country] ={...}`: Stores the calculated metrics and model parameters for each country. `slope`: Coefficient of the independent variable. `intercept`: Intercept of the regression line. `mae, mse, rmse, r_squared, adjusted_r_squared`: Performance metrics. `correlation_coef, p_value`: correlation analysis results. `data`: Stores `X`, `y`, and `y_pred` for later use.

`return results`: Returns the dictionary `results` containing analysis for all countries.



In [None]:
def analyze_renewable_gdp_correlation(data):
  results={}

  for country in data ['Country'].unique():
    country_data=data[data['Country']==country]

    X=country_data['Renewable_Energy_Share (%)'].values.reshape(-1,1)
    y=country_data['GDP_Growth_Rate (%)'].values

    model=LinearRegression()
    model.fit(X, y)

    y_pred=model.predict(X)

    mae=mean_absolute_error(y, y_pred)
    mse=mean_squared_error(y, y_pred)
    mse=np.sqrt(mse)
    rmse=np.sqrt(mse)
    r_squared=model.score(X, y)
    adjusted_r_squared=1-(1-r_squared)*(len(y)-1)/(len(y)-X.shape[1]-1)

    correlation_coef, p_value=stats.pearsonr(
        country_data['Renewable_Energy_Share (%)'],
        country_data['GDP_Growth_Rate (%)']
    )

    results[country]={
        'slope':model.coef_[0],
        'intercept': model.intercept_,
        'mae':mae,
        'mse':mse,
        'rmse':rmse,
        'r_squared': r_squared,
        'adjusted_r_squared': adjusted_r_squared,
        'correlation_coef': correlation_coef,
        'p_value': p_value,
        'data':{
            'X':X,
            'y':y,
            'y_pred': y_pred
        }

    }

  return results

## Plot Comparison
The function visualizes the relationship between renewable energy share and GDP growth rate for multiple countries (Kenya and Ireland). It plots the predicted GDP growth rates against renewable energy share for comparison.

##Code breakdown
1. `plt.figure(figsize=(12, 6))`: Creates a new figure for plotting with a size of 12 inches by 6 inches. Ensures the plot has sufficient space and is easy to read.

2. `countries=list(results.keys())`: Extracts the list of country names (keys) from the `results` dictionary.

3. `colors=['blue', 'green']`: Defines a list of colours for plotting different countries. If there are more than two countries, you'll need to extend this list.

4. `for i, (country, color) in enumerate (zip(countries, colors))`: Loops through each country and its corresponding color using `zip`. `i` is the index of the loop (unused here).

5. `country_results=results[country]`. Retrieves the results for the current `country` from the `results` dictionary.

6. `plt.scatter(...)`: Plots a scatter plot for the predicted data. `country_results['data']['X']`: renewable energy share (`X`) for the country. `country_results['data']['y_pred']`: Predicted GDP growth rates (`y_pred`) for the country. `color=color`: Assigns a unique color to each country. `linestyle='--'`: Dashed line style (not valid for scatter plots, can be removed). `label=f'{country}(Predicted)'`: Adds a legend label with the name and "Predicted".

7. `plt.xlabel('Renewable Energy Share (%)')`: Sets the label for the x-axis.

8. `plt.ylabel('GDP Growth Rate (%)')`: Sets the label for the y-axis.

9. `plt.title(...)`: Sets the title of the plot "Relationship between Renewable Energy Share and GDP Growth".

10. `plt.legend()`: Displays a legend indicating which country corresponds to which color in the plot.

11. `plt.grid(True, alpha=0.3)`: adds a light grid to the plot for better readability. `alpha=0.3` makes the gridlines semi-transparent.

12. `return plt`: returns the plot object so it can be displayed or saved.

In [None]:
def plot_comparison(data, results):
  plt.figure(figsize=(12, 6))

  countries=list(results.keys())
  colors=['blue', 'green']

  for i, (country, color) in enumerate (zip(countries, colors)):
    country_results=results[country]

    plt.scatter(
        country_results['data']['X'],
        country_results['data']['y_pred'],
        color=color,
        linestyle='--',
        label=f'{country}(Predicted)'
    )

  plt.xlabel('Renewable Energy Share (%)')
  plt.ylabel('GDP Growth Rate (%)')
  plt.title('Relationship between Renewable Energy Share and GDP Growth ')
  plt.legend()
  plt.grid(True, alpha=0.3)

  return plt

## Print Detailed Metrics
The function iterates through the `results` dictionary and prints detailed metrics for each country. It organizes the metrics into categories such as model parameters, error metrics, goodness of fit and correlation analysis for readabilty.
##Code Breakdown
1. `for country, result in results.items():`: Iterates through the `results` dictionary, where `country` is the key (name of the country). `result` is the corresponding dictionary containing the metrics and model data.

2. `print(f"\nDetailed metrics for {country}:")`: Prints the name of the country, prefixed by a newline (`\n`) for better readability.

3. `print("-"*40)`: Prints a horizontal separator line (40 dashes) for better visual organization.

4. `print(f"Model Parameters:")`: Indicates the start of the model parameter section.

5. `print(f"Slope: {result['slope']:.4f}")`: prints the slope (coefficient of the independent variable) rounded to 4 decimal places. `result['slope']`: Indicates the rate of change in GDP growth rate with respect to renewable energy share.

6. `print(f"Intercept: {result['intercept']:.4f}")`: Prints the intercept of the regression line, rounded to  decimal places. `result['intercept']`: The predicted GDP growth rate when the renewable energy share is zero.

##Error metrics
7. `print("\nError Metrics:")`: Introduces the error metrics section. `MAE`: Average absolute error in predictions. `MSE`:Average squared  error in predictions. `RMSE`: Square root of the MSE, representing the standard deviation of prediction errors. All values are rounded to 4 decimal places.

##Goodness of fit
9. `print("\nGoodness of fit:")`: Introduces the section for R-squared and adjusted R-Squared metrics.

10. Prints `R-Squared` and `Adjusted R-Squared`: `R-Squared`: Proportion of variance in GDP growth rate explained by the renewable energy share. `Adjusted R-Squared`: Adjusted for the number of predictors to avoid overestimation.

##Correlation analysis
11. `print("\nCorrelation Analysis:")`: introduces the correlation analysis section.
12. Prints Correlation Coefficient and P-Value:
`Correlation coefficient`: Indicates the strength and direction of the linear relationship. `P-Value`: asses the statistical significance of the relationship.

In [None]:
def print_detailed_metrics(results):
  for country, result in results.items():
    print(f"\nDetailed metrics for {country}:")
    print("-"* 40)
    print(f"Model Parameters:")
    print(f"Slope: {result['slope']:.4f}")
    print(f" Intercept: {result['intercept']:.4f}")
    print("\nError Metrics:")
    print(f"Mean Absolute Error (MAE): {result['mae']:.4f}")
    print(f"Mean Squared Error (MSE): {result['mse']:.4f}")
    print(f" Root Mean Squared Error (RMSE): {result['rmse']:.4f}")
    print("\nGoodness of fit:")
    print(f"R-Squared:{result['r_squared']:.4f}")
    print(f"Adjusted R-Squared: {result['adjusted_r_squared']:.4f}")
    print("\nCorrelation Analysis:")
    print(f"Correlation coefficient: {result['correlation_coef']:.4f}")
    print(f"P_value: {result['p_value']:.4f}")

In [None]:
data=pd.read_csv('/content/synthetic_renewable_energy_analysis.csv')
results=analyze_renewable_gdp_correlation(data)
print_detailed_metrics(results)

plot=plot_comparison(data, results)
plt.show()
