> **Tip**: Welcome to the Investigate a Dataset project! You will find tips in quoted sections like this to help organize your approach to your investigation. Before submitting your project, it will be a good idea to go back through your report and remove these sections to make the presentation of your work as tidy as possible. First things first, you might want to double-click this Markdown cell and change the title so that it reflects your dataset and investigation.

# Project: Investigate a Dataset (Replace this with something more specific!)

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction


I decided to analyze no show appointment data set. The analysis is done in context to find if there is any relation between patients presence with any condition.

In [None]:
# Use this cell to set up import statements for all of the packages that you
#   plan to use.

# Remember to include a 'magic word' so that your visualizations are plotted
#   inline with the notebook. See this page for more:
#   http://ipython.readthedocs.io/en/stable/interactive/magics.html
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
from datetime import datetime

<a id='wrangling'></a>
## Data Wrangling


Imported the data and made some general observation about data, if null is present, duplicates are present, data types etc

### General Properties



In [None]:
# Load your data and print out a few lines. Perform operations to inspect data
#   types and look for instances of missing or possibly errant data.
df = pd.read_csv('noshowappointments-kagglev2-may-2016.csv')
df.head(1)

In [None]:
'''
No null records found
'''
df.info() 


In [None]:
'''First look of data'''
df.describe()

In [None]:
'''No duplicates present'''
print(sum(df.duplicated()))
df.drop_duplicates(inplace=True) #no duplicates found

> **Tip**: You should _not_ perform too many operations in each cell. Create cells freely to explore your data. One option that you can take with this project is to do a lot of explorations in an initial notebook. These don't have to be organized, but make sure you use enough comments to understand the purpose of each code cell. Then, after you're done with your analysis, create a duplicate notebook where you will trim the excess and organize your steps so that you have a flowing, cohesive report.

> **Tip**: Make sure that you keep your reader informed on the steps that you are taking in your investigation. Follow every code cell, or every set of related code cells, with a markdown cell to describe to the reader what was found in the preceding cell(s). Try to make it so that the reader can then understand what they will be seeing in the following cell(s).

### Data Cleaning 


<ul>
<li> Convert Appointment day into date format.
<li> Convert Scheduled day into date format.
<li> Check anomalies in data entry with date, i.e ignore data where schedule day is after appointment day. 
<li> Rename No-show to more explanatory AppointmentAttendance and store numerical value, 1 for present and 0 for absence for ease of analysis. 
<li> Save day difference between appointment day and scheduled day for further EDA. 
</ul>

In [None]:
'''Convert days from string to date format'''

df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay'], format="%Y-%m-%d").dt.date
df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay'], format="%Y-%m-%d").dt.date


In [None]:
'''check if any anomalies are present with date where appointment date is less than scheduled date'''
np.where(df['AppointmentDay'] < df['ScheduledDay'])

In [None]:
'''confirm the above output'''
df.loc[27033]

In [None]:
'''save difference of appointment date and scheduled date in a new column to filter upon'''
df['DeltaAppointmentScheduled'] = (df['AppointmentDay'] - df['ScheduledDay']).dt.days

In [None]:
'''filter all record where date delta is not equal or greater than 0'''
df = df[df['DeltaAppointmentScheduled'] >= 0]

In [None]:
'''confirm the number of rows, it should be original raw number - number of anomalies detected in np.where'''
df.info()

In [None]:
'''rename No-Show to appointment attendance'''
df.rename(columns = {'No-show' : 'AppointmentAttendance'} , inplace=True)

In [None]:
'''replace NO with 1 and Yes with 0'''
df['AppointmentAttendance'].replace("No", 1, inplace=True)
df['AppointmentAttendance'].replace("Yes", 0, inplace=True)

In [None]:
'''have a look at data again'''
print(df.info())
df.head(1)

In [None]:
'''describe the data to see more details'''
df.describe()

<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### Research Question 1 - What is the presence and abscence percent overall ?

In [None]:
'''Taking mean calculating percent of all patients present for appointment overall'''
present = df["AppointmentAttendance"].mean()*100

In [None]:
'''Simple maths, the abscent percent will be 100 - present percent '''
abscent = 100 - present


print("No of patients present ", round(present,2) , "%")
print("No of patients NOT present ", round(abscent,2) , "%")

### Research Question 2 - Is there any impact of difference between ScheduledDay and  AppointmentDay on presence of patients?

In [None]:
'''Filter all present and abscent data in different data frames to be used later for histogram'''

df_present = df.query("AppointmentAttendance == 1")
df_not_present = df.query("AppointmentAttendance == 0")

In [None]:
'''Check no of rows present in patients present data frame etc'''
df_present.info()

In [None]:
'''check no of rows in not present data frame'''
df_not_present.info()

In [None]:
'''Plot the frequency of delta of scheduled and appointment date to see the overall distribution, this makes it
very clear that majority of data points is distributes till 100 days'''
df.plot(x="AppointmentAttendance", y= "DeltaAppointmentScheduled", kind="hist", title="Distribution of Appointment attendance frequency vs Days delta of ScheduledDay and Appointment day\n")


By looking at the above histogram we could see that the distrubution of attendance is right skewed. The less is difference between ScheduledDay and AppointmentDay the more is probability of patients to not miss the appointment.

In [None]:
''' Plot distribution of percent of patients present in 100 days delta'''

y = df.groupby('DeltaAppointmentScheduled')['AppointmentAttendance'].mean()[0:100]*100
x = range(0, 100)
plt.figure(figsize=(12,5))
plt.scatter(x,y);
plt.title("Percentage of Scheduled Appointments for 100 days");
plt.xlabel("Advance scheduled days");
plt.ylabel("Percent of Patients present");

In above scatter plot we could see that the correlation is negative, the data points are more dense and higher when
advance scheduled date is less and tends to be more distributes and lower afterwards. The point to note here is, the
density is higher in BIN 0-20 even when the actual data points is highest so it means the probability is also higher here
along with percentage.

In [None]:
'''Plot histogram of patients present and anscent, each bin represents 1 weeks here'''
plt.figure(figsize=(12,5))
plt.hist(df_present['DeltaAppointmentScheduled'], bins=14, range=(0,100));
plt.hist(df_not_present['DeltaAppointmentScheduled'], bins=14, range=(0,100));
plt.legend(['Patients Present','Patients Missed']);
plt.title('Distribution for 100 days');
plt.xlabel('Days Scheduled in Advance')
plt.ylabel('Appointments');

In above histogram we could see the distribution of number of patients missed and patients present after taking appointment. we could see there is similar trend in both of the plot. Patients missing and patients present distribution 
both are right skewed.

In [None]:
'''Get percent of patients present if the difference of appointment and scheduled day is between different range '''

print(df.query("DeltaAppointmentScheduled < 10 ")['AppointmentAttendance'].mean()*100)
print(df.query("DeltaAppointmentScheduled < 20 and DeltaAppointmentScheduled > 10")['AppointmentAttendance'].mean()*100)
print(df.query("DeltaAppointmentScheduled < 30 and DeltaAppointmentScheduled > 20")['AppointmentAttendance'].mean()*100)
print(df.query("DeltaAppointmentScheduled < 40 and DeltaAppointmentScheduled > 30")['AppointmentAttendance'].mean()*100)
print(df.query("DeltaAppointmentScheduled < 50 and DeltaAppointmentScheduled > 40")['AppointmentAttendance'].mean()*100)
print(df.query("DeltaAppointmentScheduled < 60 and DeltaAppointmentScheduled > 50")['AppointmentAttendance'].mean()*100)
print(df.query("DeltaAppointmentScheduled < 70 and DeltaAppointmentScheduled > 60")['AppointmentAttendance'].mean()*100)
print(df.query("DeltaAppointmentScheduled < 80 and DeltaAppointmentScheduled > 70")['AppointmentAttendance'].mean()*100)
print(df.query("DeltaAppointmentScheduled < 90 and DeltaAppointmentScheduled > 80")['AppointmentAttendance'].mean()*100)
print(df.query("DeltaAppointmentScheduled < 100 and DeltaAppointmentScheduled > 90")['AppointmentAttendance'].mean()*100)

### Research Question 3  Is there any relation of presence of patients and any medical conditions (including alcoholism) ?