### Notebook: Data Exploration Steps
---

Starting with data exploration is crucial for understanding the datasets' structure, content, and potential insights. Here are some guided steps you can include in your Jupyter Notebook to explore the data.

1. **Initial Data Loading and Inspection**
- Load each dataset into separate data frames. Use libraries such as `pandas` for this purpose.

In [9]:
import pandas as pd
import geopandas as gpd
accidents_data = pd.read_csv('data/accidents_Berlin_2021_cleaned.csv')
bikelanes_data = gpd.read_file('data/bike_lanes_Berlin.geojson')
air_quality_data = pd.read_csv('data/air_quality_Berlin_2021.csv')

- Perform a basic inspection of the data using methods like `.head()`, `.info()`, and `.describe()` to understand the structure, types of data available, and statistical summaries.

In [10]:
accidents_data.head()

Unnamed: 0,ObjectID,State,District,LOR_ab_2021,AccidentYear,AccidentMonth,AccidentHour,DayOfWeek,AccidentCategory,AccidentType,...,InvolvingCar,InvolvingPedestrian,InvolvingMotorcycle,InvolvingHGV,InvolvingOther,RoadCondition,GraphicCoord1,GraphicCoord2,LongitudeWGS84,LatitudeWGS84
0,219249,11,3,3701658.0,2021,11,18,2,3,0,...,1,0,1,0,0,1,8002020742,5829640204,13.426895,52.53394
1,219248,11,7,7501134.0,2021,12,19,7,3,6,...,1,1,0,0,0,1,7984795317,5819049219,13.39209,52.439951
2,219247,11,4,4100101.0,2021,12,17,4,3,5,...,1,0,0,0,0,0,7933526128,5829680195,13.326242,52.538028
3,219246,11,4,4501041.0,2021,12,15,7,3,5,...,1,0,1,0,0,1,7929500395,5825362081,13.316521,52.499534
4,219243,11,11,11501339.0,2021,12,9,5,3,3,...,1,0,0,0,1,2,80718201,5825602793,13.525752,52.493867


In [11]:
bikelanes_data.head()

Unnamed: 0,gml_id,subject_code,segment,segment_district,station_street,storage_name,location,lane_type,length,use_mandatory,geometry
0,b_radverkehrsanlagen.1,10-000415,58530025_58530021.01,B1/B5,Alt-Kaulsdorf,Marzahn-Hellersdorf,Kaulsdorf,Bike path,26.0,yes,"MULTILINESTRING ((13.58199 52.50505, 13.58237 ..."
1,b_radverkehrsanlagen.2,10-000039,60530006_60530009.01,B1/B5,Alt-Mahlsdorf,Marzahn-Hellersdorf,Mahlsdorf,Bike path,16.0,yes,"MULTILINESTRING ((13.60630 52.50471, 13.60639 ..."
2,b_radverkehrsanlagen.3,10-000038,60530005_60530006.01,B1/B5,Alt-Mahlsdorf,Marzahn-Hellersdorf,Mahlsdorf,Bike path,510.0,no,"MULTILINESTRING ((13.59882 52.50483, 13.59905 ..."
3,b_radverkehrsanlagen.4,10-000011,57530025_57530026.01,B1/B5,Alt-Biesdorf,Marzahn-Hellersdorf,Biesdorf,Bike path,76.0,yes,"MULTILINESTRING ((13.56179 52.50846, 13.56214 ..."
4,b_radverkehrsanlagen.5,10-000012,57530001_57530017.01,B1/B5,Alt-Biesdorf,Marzahn-Hellersdorf,Biesdorf,Bike path,192.0,yes,"MULTILINESTRING ((13.56277 52.50814, 13.56333 ..."


In [12]:
air_quality_data.head()

Unnamed: 0,name,code,codeEu,address,lat,lng,date,value,pollutant
0,010 Wedding,mc010,DEBE010,"13353 Wedding, Amrumer Str./Limburger Str.",52.54291,13.34926,2021-12-31,13.0,pm10
1,010 Wedding,mc010,DEBE010,"13353 Wedding, Amrumer Str./Limburger Str.",52.54291,13.34926,2021-12-30,13.0,pm10
2,010 Wedding,mc010,DEBE010,"13353 Wedding, Amrumer Str./Limburger Str.",52.54291,13.34926,2021-12-30,13.0,pm10
3,010 Wedding,mc010,DEBE010,"13353 Wedding, Amrumer Str./Limburger Str.",52.54291,13.34926,2021-12-30,14.0,pm10
4,010 Wedding,mc010,DEBE010,"13353 Wedding, Amrumer Str./Limburger Str.",52.54291,13.34926,2021-12-30,14.0,pm10


2. **Data Cleaning**
- Identify and handle missing values. Decide whether to fill them with data (e.g., mean, median) or remove the rows/columns entirely.

In [13]:
# for example, you might want to check for missing values in the air pollution dataset
# show rows with missing values
air_quality_data[air_quality_data['value'].isnull()]

Unnamed: 0,name,code,codeEu,address,lat,lng,date,value,pollutant
1490,018 Schöneberg,mc018,DEBE018,"10823 Berlin, Belziger Str. 52",52.485790,13.348850,,,
1491,027 Marienfelde,mc027,DEBE027,"12307 Berlin, Schichauweg 60",52.398400,13.368075,,,
6532,085 Friedrichshagen,mc085,DEBE056,"12587 Berlin, Müggelseedamm 307-310",52.447697,13.647050,2021-12-07,,pm10
6533,085 Friedrichshagen,mc085,DEBE056,"12587 Berlin, Müggelseedamm 307-310",52.447697,13.647050,2021-12-07,,pm10
6534,085 Friedrichshagen,mc085,DEBE056,"12587 Berlin, Müggelseedamm 307-310",52.447697,13.647050,2021-12-07,,pm10
...,...,...,...,...,...,...,...,...,...
7372,085 Friedrichshagen,mc085,DEBE056,"12587 Berlin, Müggelseedamm 307-310",52.447697,13.647050,2021-12-03,,pm25
7452,115 Hardenbergplatz,mc115,DEBE067,"10623 Berlin, Hardenbergplatz",52.506630,13.332976,,,
10433,145 Frohnau,mc145,DEBE062,"13465 Berlin, Jägerstieg 1",52.653270,13.296080,,,
13414,282 Karlshorst,mc282,DEBE066,"10318 Berlin, Johanna-und-Willy-Brauer-Platz",52.485300,13.529500,,,


- Check for and correct any inconsistencies in data types (e.g., dates stored as strings).

In [14]:
# for example, check the data types of the columns in the accidents dataset
accidents_data.dtypes

ObjectID                 int64
State                    int64
District                 int64
LOR_ab_2021            float64
AccidentYear             int64
AccidentMonth            int64
AccidentHour             int64
DayOfWeek                int64
AccidentCategory         int64
AccidentType             int64
AccidentTypeDetail       int64
LightingCondition        int64
InvolvingBike            int64
InvolvingCar             int64
InvolvingPedestrian      int64
InvolvingMotorcycle      int64
InvolvingHGV             int64
InvolvingOther           int64
RoadCondition            int64
GraphicCoord1           object
GraphicCoord2           object
LongitudeWGS84         float64
LatitudeWGS84          float64
dtype: object

3. **Feature Engineering**
- Create new columns that could be useful for analysis. For example, from the accidents dataset, you might extract the day of the week or the time of day.
- Consider combining data from different sources based on common attributes (e.g., geolocation features).

4. **Exploratory Data Analysis (EDA)**
- Utilize visualizations (histograms, scatter plots, box plots) to understand distributions, relationships, and potential outliers in the data.
- Map geolocation data to visualize accidents, pollution levels, and cycling routes. Tools like matplotlib, seaborn, or geopandas can be helpful here.

5. **Preliminary Analysis**
- Start with simple analyses to gain insights from the data. For accidents data, you could calculate the frequency of accidents by location or time. For air pollution, assess average pollution levels across different areas or times of the year.
- Correlate different data points, like the proximity of accident hotspots to the most polluted areas.

6. **Idea Generation and Hypothesis Formation**
- Based on your initial findings, brainstorm potential use cases or areas of interest for further exploration. This could range from identifying the safest and healthiest cycling routes to predicting high-risk areas for future accidents or pollution spikes.
- Formulate hypotheses or questions that you want your analysis to address.

In [7]:
# now it's your turn to explore the data and come up with interesting insights and ideas for your project!
# Happy Hacking!