   # Programming for Data Analysis Project

<img src="https://static.voices.com/wp-content/uploads/History-of-Audiobooks.jpg" width="700" height="400">

## 1. Introduction and Project Overview
***
### Objectives of the Project:

For this project you must create a data set by simulating a real-world phenomenon of your choosing. You may pick any phenomenon you wish – you might pick one that is of interest to you in your personal or professional life. Then, rather than collect data related to the phenomenon, you should model and synthesise such data using Python. We suggest you use the numpy.random package for this purpose.
Specifically, in this project you should:
- Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.
- Investigate the types of variables involved, their likely distributions, and their relationships with each other.
- Synthesise/simulate a data set as closely matching their properties as possible.
- Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.

For this project I've decided to research the statistics for audiobooks services which becoming more popular every year. As a person who loves both audiobooks and statistics, I was curious to find out things like, who listens to audiobooks, how many books does the average audiobook listener go through in a year, and how many new audiobooks are published. 

I'll pick a selection of relevant variables, examine their properties, potential data types, and how they relate to each other. I'll then try and work out code to simulate a random dataset based on that information.

First of all I decided to look at my personal statistics which is readily available for my from my audible app. I’ve been a member since April 2016 and in total for that period (till July 2022) I’ve listened for the audiobooks for 2 months 14 days 15 hours and 16 minutes = 108 734.128 minutes, well done me! Looking back at last 5 months, March and April have the highest listening time, both at about 34 hours. On average my listening time will be around 30 hours a month. Now, it will be interesting to find out how do I compare to the average listener :)

## 2. Research and interesting facts
***
Digital audiobooks continue to be the fastest growing segment in publishing.  Not only is the entire publishing industry making more money, but there is more choice available for customers. In 2019 audiobook sales increased by 16% in the United States and generated over 1.2 billion dollars in revenue, whereas in 2018 it only made 940 million, an increase of 25% from 2017. (https://goodereader.com/blog/audiobooks/audiobook-trends-and-statistics-for-2020)

Edison Research national survey of American audiobook listeners ages 18 and up found that the average number of audiobooks listened to per year increased to 8.1 in 2020, up from 6.8 in 2019.
The most popular audiobook genre continues to be Mysteries/Thrillers/Suspense. 57% of frequent audiobook listeners are under the age of 45; this is up from 51% in 2019. (https://www.audiopub.org/uploads/pdf/2020-Consumer-Survey-and-2019-Sales-Survey-Press-Release-FINAL.pdf)

How are people in the US listening to audiobooks? Smart speakers are becoming increasingly popular from products such as Amazon Echo, Google Home or Apple HomePod. In a recent poll from the American Audiobook Publishers Association found that 60% of respondents own a smart speaker, and 46% of smart speaker owners have used it to listen to an audiobook, which is up 31% from 2018.  Although the automobile is still the number one place where people listen to audiobooks, the home is where audiobooks are played for longer durations. (https://goodereader.com/blog/audiobooks/audiobook-trends-and-statistics-for-2020)

Women (between 30 and 49 years old) recently overtook men as a most active audiobook listeners. Audiobooks aren’t particularly popular with people 65 years and older. This group still prefers eBooks and print books.
Given the number of audiobooks people listen to and their high price compared to ebooks and even print books, it isn’t surprising that average audiobook listener comes from a higher-income household. In a 2019 survey, 30% of participants with a 75k yearly income said they had listened to at least one audiobook in the previous year. (https://www.statista.com/statistics/299808/audiobook-listening-population-in-the-us-by-household-income/) 


## Generating data
***
We begin by importing the necessary python packages.

In [1]:
# numerical arrays
import numpy as np

# dataframes
import pandas as pd

# plotting
import matplotlib.pyplot as plt

# nicer plotting
import seaborn as sns

# module for choosing from a list
from secrets import choice

# Regular expressions
import re 
import string

# python standard random library
import random

# Importing Numpy random generator and assigning it to rng variable
from numpy.random import default_rng

# Seed value 123 is set for reproduceable random data
rng = default_rng(seed=123)

Let's begin by creating user IDs for our audiobook sevice create 