# Preparation for Facebook Technical Questions

Here are my notes for preparing for the technical portion of the interview. To simulate interview conditions, I wrote all code in markdown mode only converting the cells to executable after I felt comfortable they were correct.

## Notes

<a id='pandas-dates'></a> [**Pandas - Time Series / Date Functionality**](http://pandas.pydata.org/pandas-docs/version/0.23/timeseries.html)

* Uses the numpy ```datetime64``` and ```timedelta64``` datatypes

<table border="1" class="docutils">
<colgroup>
<col width="15%">
<col width="27%">
<col width="58%">
</colgroup>

<thead valign="bottom">
<tr class="row-odd"><th class="head">Class</th>
<th class="head">Remarks</th>
<th class="head">How to create</th>
</tr>
</thead><tbody valign="top">
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">Timestamp</span></code></td>
<td>Represents a single timestamp</td>
<td><code class="docutils literal notranslate"><span class="pre">to_datetime</span></code>, <code class="docutils literal notranslate"><span class="pre">Timestamp</span></code></td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">DatetimeIndex</span></code></td>
<td>Index of <code class="docutils literal notranslate"><span class="pre">Timestamp</span></code></td>
<td><code class="docutils literal notranslate"><span class="pre">to_datetime</span></code>, <code class="docutils literal notranslate"><span class="pre">date_range</span></code>, <code class="docutils literal notranslate"><span class="pre">bdate_range</span></code>, <code class="docutils literal notranslate"><span class="pre">DatetimeIndex</span></code></td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">Period</span></code></td>
<td>Represents a single time span</td>
<td><code class="docutils literal notranslate"><span class="pre">Period</span></code></td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">PeriodIndex</span></code></td>
<td>Index of <code class="docutils literal notranslate"><span class="pre">Period</span></code></td>
<td><code class="docutils literal notranslate"><span class="pre">period_range</span></code>, <code class="docutils literal notranslate"><span class="pre">PeriodIndex</span></code></td>
</tr>
</tbody>
</table>

* Both ```Timestamp``` and  ```Period``` objects can serve as an index. They are automatically cast into ```DatetimeIndex``` and ```PeriodIndex``` objects.

In [1]:
import pandas as pd
# Can convert from strings to date-like objects via pd.to_datetime
pd.to_datetime(pd.Series(['Jul 31, 2009', '2010-01-10', None]))

  return f(*args, **kwds)


0   2009-07-31
1   2010-01-10
2          NaT
dtype: datetime64[ns]

In [2]:
# Make range of dates:
pd.date_range('2010-02-20', '2011-03-05')

DatetimeIndex(['2010-02-20', '2010-02-21', '2010-02-22', '2010-02-23',
               '2010-02-24', '2010-02-25', '2010-02-26', '2010-02-27',
               '2010-02-28', '2010-03-01',
               ...
               '2011-02-24', '2011-02-25', '2011-02-26', '2011-02-27',
               '2011-02-28', '2011-03-01', '2011-03-02', '2011-03-03',
               '2011-03-04', '2011-03-05'],
              dtype='datetime64[ns]', length=379, freq='D')

<a id='pandas-join'></a> [**Pandas - Merge, Join, Concatenate**](https://pandas.pydata.org/pandas-docs/stable/merging.html)

In [29]:
# Define example dataframes and series
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                   index=[0, 1, 2, 3])

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},
                   index=[4, 5, 6, 7])
 

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                    'B': ['B8', 'B9', 'B10', 'B11'],
                    'C': ['C8', 'C9', 'C10', 'C11'],
                    'D': ['D8', 'D9', 'D10', 'D11']},
                   index=[8, 9, 10, 11])
s1 = pd.Series(['X0', 'X1', 'X2', 'X3'], name='X')

s2 = pd.Series(['X0', 'X1', 'X2', 'X3'],
               index=['A', 'B', 'C', 'D'])

#### Append

In [28]:
# Note that Series objects are more accurately thought of as row objects rather
# than column objects as they are typically displayed

result = df1.append(s2, ignore_index=True)
print('{}\n\n{}\n\n{} '.format(df1, s2, result))

    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3

A    X0
B    X1
C    X2
D    X3
dtype: object

    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3
4  X0  X1  X2  X3 


#### Concat

In [33]:
################################################################################
# Notes
#
# - Argument must be a list of DataFrames or Series, or a dict
# - In this example all dataframes have the same columns and non-overlapping
################################################################################
result = pd.concat([df1, df2, df3])
print(result)

      A    B    C    D
0    A0   B0   C0   D0
1    A1   B1   C1   D1
2    A2   B2   C2   D2
3    A3   B3   C3   D3
4    A4   B4   C4   D4
5    A5   B5   C5   D5
6    A6   B6   C6   D6
7    A7   B7   C7   D7
8    A8   B8   C8   D8
9    A9   B9   C9   D9
10  A10  B10  C10  D10
11  A11  B11  C11  D11


In [39]:
################################################################################
# Notes
#
# - Can attach a distinguishing label for each constituent dataframe now we 
#     will have a multiindex
# - Concat makes a full copy of the data
################################################################################
result = pd.concat([df1, df2, df3], keys=['x', 'y', 'z'])
print(result)

        A    B    C    D
x 0    A0   B0   C0   D0
  1    A1   B1   C1   D1
  2    A2   B2   C2   D2
  3    A3   B3   C3   D3
y 4    A4   B4   C4   D4
  5    A5   B5   C5   D5
  6    A6   B6   C6   D6
  7    A7   B7   C7   D7
z 8    A8   B8   C8   D8
  9    A9   B9   C9   D9
  10  A10  B10  C10  D10
  11  A11  B11  C11  D11


In [53]:
################################################################################
# Notes
#
# - The default merge method is 'outer' meaning that it will use the union of 
#     all keys in both dataframes
# - Note that with an outer join it fills all missing fields with nans
################################################################################
df4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
                    'D': ['D2', 'D3', 'D6', 'D7'],
                    'F': ['F2', 'F3', 'F6', 'F7']},
                   index=[2, 3, 6, 7])
# Concatenate along rows (i.e. vertical stacking)
result = pd.concat([df1, df4], axis=0)
print(result)
print('\n')
# Concatenate along columns (i.e. horizontal stacking)
result = pd.concat([df1, df4], axis=1)
print(result)

     A   B    C   D    F
0   A0  B0   C0  D0  NaN
1   A1  B1   C1  D1  NaN
2   A2  B2   C2  D2  NaN
3   A3  B3   C3  D3  NaN
2  NaN  B2  NaN  D2   F2
3  NaN  B3  NaN  D3   F3
6  NaN  B6  NaN  D6   F6
7  NaN  B7  NaN  D7   F7


     A    B    C    D    B    D    F
0   A0   B0   C0   D0  NaN  NaN  NaN
1   A1   B1   C1   D1  NaN  NaN  NaN
2   A2   B2   C2   D2   B2   D2   F2
3   A3   B3   C3   D3   B3   D3   F3
6  NaN  NaN  NaN  NaN   B6   D6   F6
7  NaN  NaN  NaN  NaN   B7   D7   F7


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  del sys.path[0]


In [55]:
################################################################################
# Notes
#
# - A join of inner only returns the columns and rows that can be filled in
#     completely, i.e. the intersection
################################################################################
df4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
                    'D': ['D2', 'D3', 'D6', 'D7'],
                    'F': ['F2', 'F3', 'F6', 'F7']},
                   index=[2, 3, 6, 7])
# Concatenate along rows (i.e. vertical stacking)
# Drops columns A, C since the are not present in df4, it drops F not in df1
result = pd.concat([df1, df4], join='inner', axis=0)
print(result)
print('\n')
# Concatenate along columns (i.e. horizontal stacking)

result = pd.concat([df1, df4], join='inner', axis=1)
print(result)

    B   D
0  B0  D0
1  B1  D1
2  B2  D2
3  B3  D3
2  B2  D2
3  B3  D3
6  B6  D6
7  B7  D7


    A   B   C   D   B   D   F
2  A2  B2  C2  D2  B2  D2  F2
3  A3  B3  C3  D3  B3  D3  F3


**Difference between '```append```, ```concat```, ```merge```, and ```join```**

**```append```**
* Solely for appending rows to a dataframe, but it is typically slow and seldomly used in favor of concat.
* Exists as a dataframe method (i.e. is called via df.append )

**```concat```**
* For stacking dataframes vertically or horizontally.
* Exists in the pandas namespace (i.e. is called via pd.concat)

**```merge```**
* For performing relational database style stitching

**```join```**
* A shortcut for merging on indices as opposed to merge which allows you to join along arbitrary columns.


## Exercises

### Mock Question from E-mail

An attendance log for every student in a school district ```attendance_events```:

| date | student_id | attendance |
|:----:|:----------:|:----------:|
|      |            |            |

A summary table with demographics for each student in the district ```all_students```: 

|student_id | school_id | grade_level | date_of_birth | hometown |
|-----------|-----------|-------------|---------------|----------|

Using this data, you could answer questions like the following:

* What percent of students attend school on their birthday?
* Which grade level had the largest drop in attendance between yesterday and today?

In [6]:
import pandas as pd
import numpy as np

# Functions used to generate mock data simulating the tables given above.

n_students = 1000
n_days = 10
start_date = '2017-09-01'
end_date = '2018-06-15'

################################################################################
# Attendance Table                                                             #
################################################################################
def _make_event_dates(n_students, start_date, end_date):
    dr = pd.date_range(start_date, end_date)
    dates = []
    for day in dr:
        dates.extend([day] * n_students)
    return dates

def _make_student_ids(n_students, n_days=None):
    student_ids = [xx for xx in range(100, 100 + n_students)]
    if n_days is not None:
        student_ids = student_ids * n_days
    return student_ids

def _make_attendance(n_students, n_days):
    return list(np.random.choice(2, n_students, p=[0.3, 0.7])) * n_days

def build_attendance_events(n_students, start_date, end_date):
    columns = ['date', 'student_id', 'attendance']
    n_days = len(pd.date_range(start_date, end_date))
    
    dates       = _make_event_dates(n_students, start_date, end_date)
    student_ids = _make_student_ids(n_students, n_days)
    attendance  = _make_attendance(n_students, n_days)
    data = [xx for xx in zip(dates, student_ids, attendance)]
    
    df = pd.DataFrame(data=data, columns=columns)
    return df

################################################################################
# District All Students Table                                                  #
################################################################################
def _make_school_ids(n_students):
    schools = ['South River High School',
               'New Brunswick High School',
               'East Brunswick High School',
               'Edison High School']
    return list(np.random.choice(schools, n_students))

def _make_grade_levels(n_students):
    grades = ['Freshman', 'Sophomore', 'Junior', 'Senior']
    return list(np.random.choice(grades, n_students))
    
def _make_DOBs(grade_levels):
    birth_years = {
        'Freshman': 2005,
        'Sophomore': 2004,
        'Junior': 2003,
        'Senior': 2002
    }
    years = [birth_years[xx] for xx in grade_levels]
    months = list(np.random.choice(np.arange(1,13), len(grade_levels)))
    days = list(np.random.choice(np.arange(1,29), len(grade_levels)))
    DOBs = pd.to_datetime(['{}-{}-{}'.format(*dd) for dd in zip(months, days, years)])
    return DOBs

def _make_hometowns(school_ids):
    hometowns = [school.split('High School')[0].strip() for school in school_ids]
    return hometowns
    

def build_all_students(n_students):
    student_ids  = _make_student_ids(n_students)
    school_ids   = _make_school_ids(n_students)
    grade_levels = _make_grade_levels(n_students)
    DOBs         = _make_DOBs(grade_levels)
    hometowns    = _make_hometowns(school_ids)
    
    columns = ['student_id', 'school_id', 'grade_level', 'date_of_birth', 'hometown']
    data = [xx for xx in zip(student_ids, school_ids, grade_levels, DOBs, hometowns)]
    df = pd.DataFrame(data=data, columns=columns)
    return df
    

attendance_events = build_attendance_events(n_students, start_date, end_date)
all_students = build_all_students(n_students=n_students)

In [8]:
attendance_events.head()

Unnamed: 0,date,student_id,attendance
0,2017-09-01,100,1
1,2017-09-01,101,1
2,2017-09-01,102,1
3,2017-09-01,103,0
4,2017-09-01,104,0


In [9]:
all_students.head()

Unnamed: 0,student_id,school_id,grade_level,date_of_birth,hometown
0,100,New Brunswick High School,Freshman,2005-09-12,New Brunswick
1,101,New Brunswick High School,Sophomore,2004-09-23,New Brunswick
2,102,South River High School,Freshman,2005-06-24,South River
3,103,East Brunswick High School,Senior,2002-09-05,East Brunswick
4,104,South River High School,Freshman,2005-12-14,South River


#### What percent of students attend school on their birthday?

In [7]:
attendance_events

Unnamed: 0,date,student_id,attendance
0,2017-09-01,100,1
1,2017-09-01,101,1
2,2017-09-01,102,1
3,2017-09-01,103,0
4,2017-09-01,104,0
5,2017-09-01,105,0
6,2017-09-01,106,0
7,2017-09-01,107,1
8,2017-09-01,108,1
9,2017-09-01,109,1


#### Relevant Notes

[Pandas - Dates and Times](#pandas-dates)

[Pandas - Merge, Join, Concatenate](#pandas-join)