# 1.2 Stages of data analysis

1. Investigate information requirements
2. Data collection
3. Data organisation 
4. Data storage 
5. Data cleansing
6. Data manipulation
7. Presentation of findings 

## Remember "Shift + Enter"

## Setup

In [None]:
import micropip
await micropip.install(["pyoliteutils", "xlrd"])

In [None]:
from pyoliteutils import *
import pyoliteutils
import pandas as pd
import matplotlib.pyplot as plt

pyoliteutils.__version__

## 1.2 Stages of data analysis

In [None]:
mm("""
mindmap
  root{{Stages of data analysis}}
    1(Investigate information requirements)
        What do we need to know to make a decision\?
            e.g. market share, particulates in the air, testing of new drugs
    2(Data collection)
        How can we gather that data
            e.g. observations, interviews, review of existing data
    3(Data organisation) 
        How to organise the data so we can work with it
            e.g. digitalisation, transcription, sorting, data mining
    4(Data storage)
        How and where to store the data
            e.g. in-house, external
    5(Data cleansing) 
        Tidying up the data
            e.g. errors, missing elements, duplicates
    6(Data manipulation) 
        Working with the information to find the knowledge
            e.g. arranging, collating, aggregating, interpreting, correlation
    7(Presentation of findings) 
        Seeing the knowledge so you can make Wise decision
            e.g. tables, charts, graphs, dashboard, reports
""")

### Note : New Technique
The ```mm(diagramtext)``` function displays diagrams based on the Mermaid diagramming language. <https://mermaid.live/>


## 1 : Investigate Information Requirements 

### What do we need to know to make a decision?
#### What is the question / decision?

e.g.
- Is a new drug safe?
- Should I invest in shares in this company now?

### Question : "Should we try to reduce CO2 in the atmosphere to reduce climate change?"

#### What do we need to know?

Double click here to enter your ideas as text OR:

In [None]:
mm("""
mindmap
  root{{What do we need to know}}
    Idea 1
    Idea 2
        Related Thought
""")

## 2 : Data Collection 

### How can we gather that data?

e.g. 
* observations
* interviews
* review of existing data
* making sure IT systems record relevant data

### Task : Gather as many Data Collection methods as you can and group them by Data Types from last week.

In a table , mindmap or list


|            | Qualitative| Quantitative|
|------------|------------|-------------|
|Structured  |            |             |
|Unstructured|            |             |

In [None]:
mm("""
mindmap
  root{{What do we need to know}}
    Structured Qualitative
        Idea 1
    Unstructured Qualitative
    Structured Quantitative
    Unstructured Quantitative
""")

### Data Collection for "Should we try to reduce CO2 in the atmosphere to reduce climate change?"

- Existing data:
    - Observations of temperature over time over the globe
    - Observations of temperature over older times derived from ice cores <https://data.ess-dive.lbl.gov/catalog/d1/mn/v2/object/ess-dive-1e57f3f83864c10-20180717T104354142744>
    - Observations of CO2 in the atmosphere over time (recent and ice core based)
- Collecting new data
    - Thermometers around the world

## 3: Data Organisation 

### How to organise the data so we can work with it

e.g. 
* digitalisation
* transcription
* sorting
* data mining

## 4: Data Storage

### How and where to store the data

e.g. 
* in–house
* external
* Online in open datasets

#### Loading from online Datasets

In [None]:
co2file = await get_file_from_url("https://raw.githubusercontent.com/UTCSheffield/OCR-Unit-7-Data-analysis-and-design/main/content/data/annual-co2-emissions-per-country.csv")
co2df = pd.read_csv(co2file)
co2df

In [None]:
tempfile = await get_file_from_url("https://raw.githubusercontent.com/UTCSheffield/OCR-Unit-7-Data-analysis-and-design/main/content/data/temperature-anomaly.csv")
tempdf = pd.read_csv(tempfile)
tempdf

## 5: Data Cleansing

e.g.
* errors
* missing elements
* duplicates

## 6: Data Manipulation

e.g. 
* arranging
* collating
* aggregating
* interpreting
* correlation

### Arranging

In [None]:
worldco2 = co2df.query("Entity == 'World'")[["Year", "Annual CO₂ emissions"]]
worldco2.set_index("Year")
worldco2

In [None]:
worldtemps = tempdf.query("Entity == 'Global'")[["Year", "Global average temperature anomaly relative to 1961-1990"]]
worldtemps

## 7: Presentation of Findings

e.g. 
* tables
* charts
* graphs
* dashboard
* reports

### Graphs

In [None]:
from bokeh.plotting import figure, show

# create a new plot with a title and axis labels
p1 = figure(title="Annual CO₂ emissions", y_range=(0, 3.5e10), x_axis_label="Year", y_axis_label="Annual CO₂ emissions", height=400, width=1000)

# add a line renderer with legend and line thickness
p1.line(worldco2["Year"], worldco2["Annual CO₂ emissions"], legend_label="Temp.", line_width=2)

# show the results
show(p1)

In [None]:
# create a new plot with a title and axis labels
p2 = figure(title="Global average temperature anomaly", x_axis_label="Year",y_range=(-1, 1), y_axis_label="Global average temperature anomaly relative to 1961-1990", height=400, width=1000)
# add a line renderer with legend and line thickness
p2.line(worldtemps["Year"], worldtemps["Global average temperature anomaly relative to 1961-1990"], legend_label="Temp.", line_width=2)

# show the results
show(p2)

### Getting better Knowledge out visually

In [None]:
from bokeh.layouts import row, column
# create a new plot with a title and axis labels
show(column(p1, p2))

## What can you tell from the graphs?

Answer here?

### Should we try to reduce CO2 in the atmosphere to reduce climate change?


Answer here?

### Work in Progress

In [None]:
from numpy import arange, linspace, pi, sin

from bokeh.layouts import column
from bokeh.models import (CustomJS, LinearAxis, Range1d, Select,
                          WheelZoomTool, ZoomInTool, ZoomOutTool)
from bokeh.palettes import Bokeh6


from bokeh.plotting import figure, show

#p1 = figure(title="Annual CO₂ emissions", y_range=(0, 3.5e10), x_axis_label="Year", y_axis_label="Annual CO₂ emissions", height=400, width=1000)

# add a line renderer with legend and line thickness
#p.line(worldco2["Year"], worldco2["Annual CO₂ emissions"], legend_label="Temp.", line_width=2)


# create a new plot with a title and axis labels
#p2 = figure(title="Global average temperature anomaly", x_axis_label="Year",y_range=(-1, 1), y_axis_label="Global average temperature anomaly relative to 1961-1990", height=400, width=1000)
# add a line renderer with legend and line thickness
#



x = arange(-2*pi, 2*pi, 0.2)
y = sin(x)
y2 = linspace(0, 100, len(x))

blue, red = Bokeh6[5], Bokeh6[0]

p = figure( y_range=(-1, 1))
p.line(worldtemps["Year"], worldtemps["Global average temperature anomaly relative to 1961-1990"], legend_label="Temp.", line_width=2)

p.background_fill_color = "#fafafa"


p.axis.axis_label = "Year"
p.axis.axis_label_text_color = blue

#p.extra_x_ranges['foo'] = 
p.extra_y_ranges['foo'] = Range1d(0, 3.5e10)
red_circles = p.scatter(x, y2, color=red, size=8,
    x_range_name="foo",
    y_range_name="foo",
)

ax2 = LinearAxis(
    axis_label="red circles",
    x_range_name="foo",
    y_range_name="foo",
)
ax2.axis_label_text_color = red
p.add_layout(ax2, 'left')

ax3 = LinearAxis(
    axis_label="red circles",
    x_range_name="foo",
    y_range_name="foo",
)
ax3.axis_label_text_color = red
p.add_layout(ax3, 'below')

from bokeh.io import output_notebook
output_notebook()

try:
    show(p)
    #show(column(select, p))
except ImportError:
    pass
