# Programming for Data Analysis
by Clare Tubridy
***

## Problem Statement
> For this project you must create a data set by simulating a real-world phenomenon of your choosing. You may pick any phenomenon you wish – you might pick one that is of interest to you in your personal or professional life. Then, rather than collect data related to the phenomenon, you should model and synthesise such data using Python. We suggest you use the numpy.random package for this purpose. Specifically, in this project you should:
>
>- Choose a real-world phenomenon that can be measured and for which you could
collect at least one-hundred data points across at least four different variables.
>-  Investigate the types of variables involved, their likely distributions, and their relationships with each other.
>-  Synthesise/simulate a data set as closely matching their properties as possible.
>-  Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.
>
> Note that this project is about simulation – you must synthesise a data set. Some students may already have some real-world data sets in their own files. It is okay to base your synthesised data set on these should you wish (please reference it if you do), but the main task in this project is to create a synthesised data set. The next section gives an example project idea.

***

## Airline Baggage Complaints Simulation
### Background
Individuals who frequently fly are well aware that occasional challenges are bound to occur. Flights may experience delays or cancellations due to factors like weather conditions, mechanical issues, or labor strikes. Moreover, baggage-related issues such as loss, delay, damage, or theft are not uncommon. The fact that numerous airlines now charge for luggage makes problems with baggage especially frustrating. Such issues can significantly affect customer loyalty and also pose financial burdens for airlines, as they often incur costs associated with delivering misplaced bags.

**Phenomenon:** <br>
The phenomenon of interest is the relationship between various factors and the number of baggage complaints in the airline industry.

**Variables:**<br>
1. ***Number of Baggage Complaints:*** The total number of complaints related to the baggage issues.
2. ***Number of Scheduled Flights:*** The total number of flights scheduled by the airline.
3. ***Number of Cancelled Flights:*** The total number of flights cancelled by the airline.
4. ***Number of Passengers Enplaned:*** The total number of passengers who boarded the flights.

### Loading the Dataset

In [1]:
import pandas as pd

In [5]:
# Load dataset from CSV file
df = pd.read_csv("baggagecomplaints.csv")
df.head()

Unnamed: 0,Airline,Date,Month,Year,Baggage,Scheduled,Cancelled,Enplaned
0,American Eagle,01/2004,1,2004,12502,38276,2481,992360
1,American Eagle,02/2004,2,2004,8977,35762,886,1060618
2,American Eagle,03/2004,3,2004,10289,39445,1346,1227469
3,American Eagle,04/2004,4,2004,8095,38982,755,1234451
4,American Eagle,05/2004,5,2004,10618,40422,2206,1267581


In [6]:
# Summary of the dataframes basic information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252 entries, 0 to 251
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Airline    252 non-null    object
 1   Date       252 non-null    object
 2   Month      252 non-null    int64 
 3   Year       252 non-null    int64 
 4   Baggage    252 non-null    int64 
 5   Scheduled  252 non-null    int64 
 6   Cancelled  252 non-null    int64 
 7   Enplaned   252 non-null    int64 
dtypes: int64(6), object(2)
memory usage: 15.9+ KB
