### Recipe 1: Reading a CSV and Working with DataFrames in Pandas

##### The first step when reading a csv file and working with dataframes in Pandas is to import the following necessary packages:

In [3]:
import pandas as pd
import numpy as np

##### The following command takes a csv file and reads it into a dataframe, i.e. a 2D table in Pandas used for storing and manipulating data:

df = pd.read_csv("file_name")

##### Let's read in a csv about the annual number of new computer science PhDs in the United States by specialty.

In [17]:
df = pd.read_csv("number-new-cs-phds-us-by-specialty.csv")

##### The head( ) method displays the first 5 rows of our dataframe.

Note: A parameter can be supplied to head to set the number of rows displayed. For example, df.head(15) displays the first 15 rows of a dataframe.

In [18]:
df.head()

Unnamed: 0,Entity,Code,Year,number_new_cs_phds_by_specialty
0,Artificial intelligence/machine learning,,2010,161
1,Artificial intelligence/machine learning,,2011,159
2,Artificial intelligence/machine learning,,2012,171
3,Artificial intelligence/machine learning,,2013,138
4,Artificial intelligence/machine learning,,2014,144


##### The info( ) method returns information about the dataframe such as the number of rows and columns, and the data type of each column. 

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 231 entries, 0 to 230
Data columns (total 4 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Entity                           231 non-null    object 
 1   Code                             0 non-null      float64
 2   Year                             231 non-null    int64  
 3   number_new_cs_phds_by_specialty  231 non-null    int64  
dtypes: float64(1), int64(2), object(1)
memory usage: 7.3+ KB


##### We can select individual columns of the dataframe using the column name and square brackets.

For example, for our artificial intelligence dataframe we could select the 'Year' column with the following command: 

In [13]:
df["Year"]

0      2010
1      2011
2      2012
3      2013
4      2014
       ... 
226    2016
227    2017
228    2018
229    2019
230    2020
Name: Year, Length: 231, dtype: int64

##### We can select individual rows of the dataframe using either the .loc or .iloc( ) command.

The loc[ ] method takes the label of a row as an argument, while the iloc( ) method takes an integer index of a row as an argument. 

The command below accesses the third row in the dataframe:

In [27]:
df.iloc[3]

Entity                             Artificial intelligence/machine learning
Code                                                                    NaN
Year                                                                   2013
number_new_cs_phds_by_specialty                                         138
Name: 3, dtype: object

A range of indices can be supplied to the iloc command to access a subset of rows.

In [29]:
df.iloc[1:5]

Unnamed: 0,Entity,Code,Year,number_new_cs_phds_by_specialty
1,Artificial intelligence/machine learning,,2011,159
2,Artificial intelligence/machine learning,,2012,171
3,Artificial intelligence/machine learning,,2013,138
4,Artificial intelligence/machine learning,,2014,144


##### We can use boolean expressions to access subsets of our dataframe.

For example, let's say we want to only consider PhD's in 'Graphics/visualization":

In [30]:
df[df["Entity"] == "Graphics/visualization"]

Unnamed: 0,Entity,Code,Year,number_new_cs_phds_by_specialty
33,Graphics/visualization,,2010,73
34,Graphics/visualization,,2011,93
35,Graphics/visualization,,2012,77
36,Graphics/visualization,,2013,80
37,Graphics/visualization,,2014,87
38,Graphics/visualization,,2015,82
39,Graphics/visualization,,2016,83
40,Graphics/visualization,,2017,78
41,Graphics/visualization,,2018,84
42,Graphics/visualization,,2019,68


##### We can use the value_counts( ) method to count the unique number of values for a given column.

For example, let's calculate how many years are in our artificial dataset, and how many data values there are associated with each year.

In [35]:
df.value_counts("Year")

Year
2010    21
2011    21
2012    21
2013    21
2014    21
2015    21
2016    21
2017    21
2018    21
2019    21
2020    21
dtype: int64

##### The sort_values( ) command sorts the values in a dataframe based on supplied columns. 

Note: The default sorting order of sort_values( ) is ascending. To sort values in descneding order, supply the following argument to the method: ascending = False

In [42]:
df.sort_values("number_new_cs_phds_by_specialty")

Unnamed: 0,Entity,Code,Year,number_new_cs_phds_by_specialty
13,Computing education,,2012,0
16,Computing education,,2015,0
15,Computing education,,2014,0
14,Computing education,,2013,0
11,Computing education,,2010,0
...,...,...,...,...
225,Total,,2015,1256
230,Total,,2020,1315
224,Total,,2014,1321
226,Total,,2016,1324


This tells us that the **minimum** number of PhD's came from the "Computing education" specialty.

In [47]:
df[df["Entity"] != "Total"].sort_values("number_new_cs_phds_by_specialty", ascending = False)

Unnamed: 0,Entity,Code,Year,number_new_cs_phds_by_specialty
9,Artificial intelligence/machine learning,,2019,286
10,Artificial intelligence/machine learning,,2020,277
8,Artificial intelligence/machine learning,,2018,266
6,Artificial intelligence/machine learning,,2016,233
7,Artificial intelligence/machine learning,,2017,218
...,...,...,...,...
15,Computing education,,2014,0
16,Computing education,,2015,0
11,Computing education,,2010,0
12,Computing education,,2011,0


This tells us that the **maximum** number of PhD's came from the "Artificial intelligence/machine learning" specialty. Note that the sort_values method was called on a subset of the dataframe to exclude the 'Total' values that are sums of all the specialties.