# Text Using Markdown

**If you double click on this cell**, you will see the text change so that all of the formatting is removed. This allows you to edit this block of text. This block of text is written using [Markdown](http://daringfireball.net/projects/markdown/syntax), which is a way to format text using headers, links, italics, and many other options. Hit _shift_ + _enter_ or _shift_ + _return_ on your keyboard to show the formatted text again. This is called "running" the cell, and you can also do it using the run button in the toolbar.

# Code cells

One great advantage of IPython notebooks is that you can show your Python code alongside the results, add comments to the code, or even add blocks of text using Markdown. These notebooks allow you to collaborate with others and share your work. The following cell is a code cell.

In [None]:
# Hit shift + enter or use the run button to run this cell and see the results

print 'hello world'

In [None]:
# The last line of every code cell will be displayed by default, 
# even if you don't print it. Run this cell to see how this works.

2 + 2 # The result of this line will not be displayed
3 + 3 # The result of this line will be displayed, because it is the last line of the cell

# Nicely formatted results

IPython notebooks allow you to display nicely formatted results, such as plots and tables, directly in
the notebook. You'll learn how to use the following libraries later on in this course, but for now here's a
preview of what IPython notebook can do.

In [None]:
# If you run this cell, you should see the values displayed as a table.

# Pandas is a software library for data manipulation and analysis. You'll learn to use it later in this course.
import pandas as pd

df = pd.DataFrame({'a': [2, 4, 6, 8], 'b': [1, 3, 5, 7]})
df

In [None]:
# If you run this cell, you should see a scatter plot of the function y = x^2

%pylab inline
import matplotlib.pyplot as plt

xs = range(-30, 31)
ys = [x ** 2 for x in xs]

plt.scatter(xs, ys)

# Creating cells 
 
To create a new **code cell**, click "Insert > Insert Cell [Above or Below]". A code cell will automatically be created.

To create a new **markdown cell**, first follow the process above to create a code cell, then change the type from "Code" to "Markdown" using the dropdown next to the run, stop, and restart buttons.

# Re-running cells

If you find a bug in your code, you can always update the cell and re-run it. However, any cells that come afterward won't be automatically updated. Try it out below. First run each of the three cells. The first two don't have any output, but you will be able to tell they've run because a number will appear next to them, for example, "In [5]". The third cell should output the message "Intro to Data Analysis is awesome!"

In [None]:
class_name = "Intro to Data Analysis"

In [None]:
message = class_name + " is awesome!"

In [None]:
message

Once you've run all three cells, try modifying the first one to set `class_name` to your name, rather than "Intro to Data Analysis", so you can print that you are awesome.  Then rerun the first and third cells without rerunning the second.

You should have seen that the third cell still printed "Intro to Data Analysis is awesome!"  That's because you didn't rerun the second cell, so even though the `class_name` variable was updated, the `message` variable was not.  Now try rerunning the second cell, and then the third.

You should have seen the output change to "*your name* is awesome!"  Often, after changing a cell, you'll want to rerun all the cells below it.  You can do that quickly by clicking "Cell > Run All Below".

One final thing to remember: if you shut down the kernel after saving your notebook, the cells' output will still show up as you left it at the end of your session when you start the notebook back up. However, the state of the kernel will be reset. If you are actively working on a notebook, remember to re-run your cells to set up your working environment to really pick up where you last left off.

In [42]:
import unicodecsv
from datetime import datetime as dt


def read_csv(filename):
    with open(filename, 'rb') as f:
        reader = unicodecsv.DictReader(f)
        return list(reader)


def get_unique_set(sets, key):
    unique_set = set()
    for row in sets:
        unique_set.add(row[key])
    return unique_set


def change_key(sets, okey, nkey):
    for row in sets:
        row[nkey] = row[okey]
        del [row[okey]]


def parse_date(date):
    if date == '':
        return None
    else:
        return dt.strptime(date, '%Y-%m-%d')


def parse_maybe_int(i):
    if i == '':
        return None
    else:
        return int(i)


enrollments = read_csv('/home/loveshadev/PycharmProjects/Udacity/TestProject/DataAnalysis/enrollments.csv')
daily_engagement = read_csv('/home/loveshadev/PycharmProjects/Udacity/TestProject/DataAnalysis/daily_engagement.csv')
project_submissions = read_csv(
    '/home/loveshadev/PycharmProjects/Udacity/TestProject/DataAnalysis/project_submissions.csv')

### For each of these three tables, find the number of rows in the table and
### the number of unique students in the table. To find the number of unique
### students, you might want to create a set of the account keys in each table.
for enrollment in enrollments:
    enrollment['cancel_date'] = parse_date(enrollment['cancel_date'])
    enrollment['days_to_cancel'] = parse_maybe_int(enrollment['days_to_cancel'])
    enrollment['is_canceled'] = enrollment['is_canceled'] == 'True'
    enrollment['is_udacity'] = enrollment['is_udacity'] == 'True'
    enrollment['join_date'] = parse_date(enrollment['join_date'])

for engagement_record in daily_engagement:
    engagement_record['lessons_completed'] = int(float(engagement_record['lessons_completed']))
    engagement_record['num_courses_visited'] = int(float(engagement_record['num_courses_visited']))
    engagement_record['projects_completed'] = int(float(engagement_record['projects_completed']))
    engagement_record['total_minutes_visited'] = float(engagement_record['total_minutes_visited'])
    engagement_record['utc_date'] = parse_date(engagement_record['utc_date'])

for submission in project_submissions:
    submission['completion_date'] = parse_date(submission['completion_date'])
    submission['creation_date'] = parse_date(submission['creation_date'])

enrollment_num_rows = len(enrollments)
unique_set_enrollment = get_unique_set(enrollments, 'account_key')
enrollment_num_unique_students = len(unique_set_enrollment)

engagement_num_rows = len(daily_engagement)
unique_set_engagement = get_unique_set(daily_engagement, 'acct')
engagement_num_unique_students = len(unique_set_engagement)

submission_num_rows = len(project_submissions)
unique_set_submission = get_unique_set(project_submissions, 'account_key')
submission_num_unique_students = len(unique_set_submission)

change_key(daily_engagement, 'acct', 'account_key')

for enrollment in enrollments:
    student = enrollment['account_key']
    if student not in unique_set_engagement:
        print enrollment
        break

num_problem_students = 0
for enrollment in enrollments:
    student = enrollment['account_key']
    if (student not in unique_set_engagement and
            enrollment['join_date'] != enrollment['cancel_date']):
        num_problem_students += 1

udacity_test_accounts = set()
for enrollment in enrollments:
    if enrollment['is_udacity']:
        udacity_test_accounts.add(enrollment['account_key'])


def remove_udacity_accounts(data):
    non_udacity_data = []
    for data_point in data:
        if data_point['account_key'] not in udacity_test_accounts:
            non_udacity_data.append(data_point)
    return non_udacity_data

non_udacity_enrollments = remove_udacity_accounts(enrollments)
non_udacity_engagement = remove_udacity_accounts(daily_engagement)
non_udacity_submissions = remove_udacity_accounts(project_submissions)

paid_students = {}

for enrollment in non_udacity_enrollments:
    if (not enrollment['is_canceled'] or
            enrollment['days_to_cancel'] > 7):
        account_key = enrollment['account_key']
        enrollment_date = enrollment['join_date']
        if (account_key not in paid_students or
                enrollment_date > paid_students[account_key]):
            paid_students[account_key] = enrollment_date

def within_one_week(join_date, engagement_date):
    time_delta = engagement_date-join_date
    return time_delta.days < 7

def remove_free_trial_cancels(data):
    new_data = []
    for data_point in data:
        if data_point['account_key'] in paid_students:
            new_data.append(data_point)
    return new_data

paid_enrollments = remove_free_trial_cancels(non_udacity_enrollments)
paid_engagement = remove_free_trial_cancels(non_udacity_engagement)
paid_submissions = remove_free_trial_cancels(non_udacity_submissions)


paid_engagement_in_first_week = []
for engagement_record in paid_engagement:
    account_key = engagement_record['account_key']
    join_date = paid_students[account_key]
    engagement_record_date = engagement_record['utc_date']
    if within_one_week(join_date, engagement_record_date):
        paid_engagement_in_first_week.append(engagement_record)

print len(paid_engagement_in_first_week)

{u'status': u'canceled', u'is_udacity': False, u'is_canceled': True, u'join_date': datetime.datetime(2014, 11, 12, 0, 0), u'account_key': u'1219', u'cancel_date': datetime.datetime(2014, 11, 12, 0, 0), u'days_to_cancel': 0}
21508


Question
1.How long to submit projects
2.How do students who pass their projects differ from those who don't
3.How much time students spend taking classes
4.How time spent relates to lessons/projects completed
5.How engaagement changes
6.How many times students submit

In [None]:
def group_data(data, key_name):
    grouped_data = defaultdict(list)
    for data_point in data:
        key = data_point[key_name]
        grouped_data[key].append(data_point)
    return grouped_data

engagement_by_account = group_data(paid_engagement_in_first_week,'account_key')

In [None]:
def sum_grouped_items(grouped_data, field_name):
    summed_data = {}
    for key, data_points in grouped_data.items():
        total = 0
        for data_point in data_points:
            total += data_point[field_name]
        summed_data[key] = total
    return summed_data

total_minutes_by_account = sum_grouped_items(engagement_by_account,'total_minutes_visited')