# Pandas: End-to-End Project

In the last 4 projects you have learnt to - 

- Load and manipulate data
- Group and summarize data
- Integrate data
- Visualize data

In this project you will work through the entire data lifecycle, from loading and exploring data to cleansing it, wrangling it, integrating it, visualizing it, and finally modeling it.

## Introduction to the Dataset

Our data has been gathered at two solar power plants in India over a 34 day period. It has two pairs of files - each pair has one power generation dataset and one sensor readings dataset. The power generation datasets are gathered at the inverter level - each inverter has multiple lines of solar panels attached to it. The sensor data is gathered at a plant level - single array of sensors optimally placed at the plant.

Here's a [link](https://github.com/anikannal/Pandas_Data) to the dataset.

There are a few areas of concern at the solar power plant -

Can we predict the power generation for next couple of days? - this allows for better grid management
Can we identify the need for panel cleaning/maintenance?
Can we identify faulty or suboptimally performing equipment?

### Power Generation Data
Solar power generation data for one plant gathered at 15 minutes intervals over a 34 days period.

Files - 
- [Plant_1_Generation_Data.csv](https://raw.githubusercontent.com/anikannal/Pandas_Data/master/Power%20Data/Plant_1_Generation_Data.csv)
- [Plant_2_Generation_Data.csv](https://raw.githubusercontent.com/anikannal/Pandas_Data/master/Power%20Data/Plant_2_Generation_Data.csv)

**Column Descriptions -** 

- DATE_TIME - Date and time for each observation. Observations recorded at 15 minute intervals.
- PLANT_ID - This is the plant id. It will be common for the entire file.
- SOURCE_KEY - Source key in this file stands for the inverter id.
- DC_POWER - Amount of DC power generated by the inverter (source_key) in this 15 minute interval. Units - kW.
- AC_POWER - Amount of AC power generated by the inverter (source_key) in this 15 minute interval. Units - kW.
- DAILY_YIELD- Daily yield is a cumulative sum of power generated on that day, till that point in time.
- TOTAL_YIELD - This is the total yield for the inverter till that point in time.

### Sensor Data
Weather sensor data gathered for one solar plant every 15 minutes over a 34 days period.

Files - 
- [Plant_1_Weather_Sensor_Data.csv](https://raw.githubusercontent.com/anikannal/Pandas_Data/master/Power%20Data/Plant_1_Weather_Sensor_Data.csv)
- [Plant_2_Weather_Sensor_Data.csv](https://raw.githubusercontent.com/anikannal/Pandas_Data/master/Power%20Data/Plant_2_Weather_Sensor_Data.csv)

**Column Descriptions -** 

- DATE_TIME - Date and time for each observation. Observations recorded at 15 minute intervals.
- PLANT_ID - This is the plant id. It will be common for the entire file.
- SOURCE_KEY - Stands for the sensor panel id. This will be common for the entire file because there's only one sensor panel for the plant.
- AMBIENT_TEMPERATURE - This is the ambient temperature at the plant.
- MODULE_TEMPERATURE - There's a module (solar panel) attached to the sensor panel. This is the temperature reading for that module.
- IRRADIATION - Amount of irradiation for the 15 minute interval.

## Task 0: Import the relevant libraries

In [None]:
# Numpy, Pandas, Matplotlib, anything else?


## Task 1: Load the data

Load the power generation data and sensor data for Plant 1 into Pandas DataFrames for further analysis.

They can be found in a GitHub repo (). Here are the links for the two files - 

Power Generation Data for Plant 1 - https://raw.githubusercontent.com/anikannal/Pandas_Data/master/Power%20Data/Plant_1_Generation_Data.csv

Weather Sensor Data for Plant 1 - 
https://raw.githubusercontent.com/anikannal/Pandas_Data/master/Power%20Data/Plant_1_Weather_Sensor_Data.csv

In [None]:
# Load the two files into Pandas DataFrames



Load the power generation data and sensor data for Plant 2 into Pandas DataFrames for further analysis.

They can be found in a GitHub repo (). Here are the links for the two files -

Power Generation Data for Plant 2 - https://raw.githubusercontent.com/anikannal/Pandas_Data/master/Power%20Data/Plant_2_Generation_Data.csv

Weather Sensor Data for Plant 2 - https://raw.githubusercontent.com/anikannal/Pandas_Data/master/Power%20Data/Plant_2_Weather_Sensor_Data.csv

## Task 2: Explore each dataset - columns, counts, basic stats

In [None]:
# .info, .head, .describe



## Task 3: Understand the domain context and explore underlying patterns in the data

## Task 4: Pre-process the data to allow for some of the analysis (hint: date and time)

## Task 5: Integrate the relevant data for creating a richer dataset

Integrate Plant 1 power generation data and weather sensor data to create a new integrated DataFrame

Integrate Plant 2 power generation data and weather sensor data to create a new integrated DataFrame

## Task 6: Explore the data and try to answer questions like -

- What is the mean value of daily yield?
- What is the total irradiation per day?
- What is the max ambient and module temperature?
- How many inverters are there for each plant?
- What is the maximum/minimum amount of DC/AC Power generated in a time interval/day?
- Which inverter (source_key) has produced maximum DC/AC power?
- Rank the inverters based on the DC/AC power they produce
- Is there any missing data?

#### What is the mean value of daily yield?

#### What is the total irradiation per day?

#### What is the max ambient and module temperature?

#### How many inverters are there for each plant?

#### What is the maximum/minimum amount of DC/AC Power generated in a time interval/day?

#### Which inverter (source_key) has produced maximum DC/AC power?

#### Rank the inverters based on the DC/AC power they produce

#### Is there any missing data?

## Task 7: Visualization and further exploration
Employ different visualization techniques to understand the data and underlying patterns

Start with graphs that explain the patterns for attributes independent of other variables. These will usually be tracked as changes of attributes against DATE_TIME, DATE, or TIME. 

Examples - 
- how is DC or AC Power changing as time goes by? 
- how is irradiation changing as time goes by? 
- how are ambient and module temperature changing as time goes by? 
- how does yield change as time goes by? 

Explore plotting variables against different granularities of DATETIME and which is the best option for each variable.

#### How is DC or AC Power changing as time goes by?

#### How is irradiation changing as time goes by?

#### How is DC or AC Power changing as time goes by?

#### How are ambient and module temperature changing as time goes by?

#### How does yield change as time goes by?

Plot two variables against each other to discover degree of correlation between them. Try out different variable pairs - 
- Ambient and module temperature
- DC and AC Power
- Irradiation and module/ambient temperature
- Irradiation and DC/AC Power

Can you find different ways of visualizing the above relationships?

#### Ambient and module temperature

#### DC and AC Power

#### Irradiation and module/ambient temperature

#### Irradiation and DC/AC Power

## Task 8: Discovering correlation

Discover which variables are correlated. Which variables are independent and which ones are independent?

#### Calculate the correlation coefficients for the variables that are showing substantial visual correlation

#### Visualize the correlation between relevant variables using scatter plots

## Task 9: Tell a story

This is probably the most important skill for a Data Scientist - the ability to communicate the story. You have worked hard over the last two tasks for understand the data and discover patterns and relationships. It is time to put all that to good use.

Find one story worth telling based on all your work. Create a notebook that walks the viewer through the entire story, one step at a time.

A few tips -

- Pick an interesting conclusion that you want to arrive at
- Build a logical progression from loading and pre-processing data to showing minor observations along the way and eventually building up to the grand finale
- Substantiate your argument with data along the way (you are a data scientist not just a story teller :))
- Every good story has some key elements - characters, setting, plot, complication and solution, try to build as many of them as you can.
- Deliver for the aha moment! Give the user an insight that can potentially impact the business. If the viewer doesn't get that then they won't appreciate your effort.