**Import pandas into your notebook and read the .csv file:**

In [7]:
import pandas as pd
df = pd.read_csv('salaries_by_college_major.csv')

**Now take a look at the Pandas dataframe we've just created with .head(). This will show us the first 5 rows of our dataframe:**

In [9]:
df.head()

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
0,Accounting,46000.0,77100.0,42200.0,152000.0,Business
1,Aerospace Engineering,57700.0,101000.0,64300.0,161000.0,STEM
2,Agriculture,42600.0,71900.0,36300.0,150000.0,Business
3,Anthropology,36800.0,61500.0,33800.0,138000.0,HASS
4,Architecture,41600.0,76800.0,50600.0,136000.0,Business


**To see the number of rows and columns we can use the shape attribute:**

In [10]:
df.shape

(51, 6)

**We saw that each column had a name. We can access the column names directly with the columns attribute**

In [11]:
df.columns

Index(['Undergraduate Major', 'Starting Median Salary',
       'Mid-Career Median Salary', 'Mid-Career 10th Percentile Salary',
       'Mid-Career 90th Percentile Salary', 'Group'],
      dtype='object')

**Missing Values and Junk Data
In this case, we're going to look for NaN (Not A Number) values in our dataframe:**


In [15]:
df.isna

<bound method DataFrame.isna of                      Undergraduate Major  Starting Median Salary  \
0                             Accounting                 46000.0   
1                  Aerospace Engineering                 57700.0   
2                            Agriculture                 42600.0   
3                           Anthropology                 36800.0   
4                           Architecture                 41600.0   
5                            Art History                 35800.0   
6                                Biology                 38800.0   
7                    Business Management                 43000.0   
8                   Chemical Engineering                 63200.0   
9                              Chemistry                 42600.0   
10                     Civil Engineering                 53900.0   
11                        Communications                 38100.0   
12                  Computer Engineering                 61400.0   
13              

In [16]:
df.isna()

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False
5,False,False,False,False,False,False
6,False,False,False,False,False,False
7,False,False,False,False,False,False
8,False,False,False,False,False,False
9,False,False,False,False,False,False


In [14]:
df.tail()

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
46,Psychology,35900.0,60400.0,31600.0,127000.0,HASS
47,Religion,34100.0,52000.0,29700.0,96400.0,HASS
48,Sociology,36500.0,58200.0,30700.0,118000.0,HASS
49,Spanish,34000.0,53100.0,31000.0,96400.0,HASS
50,Source: PayScale Inc.,,,,,


**Delete the rows with missing values:
The last row has missing info**

In [18]:
clean_df = df.dropna()
clean_df.tail()

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
45,Political Science,40800.0,78200.0,41200.0,168000.0,HASS
46,Psychology,35900.0,60400.0,31600.0,127000.0,HASS
47,Religion,34100.0,52000.0,29700.0,96400.0,HASS
48,Sociology,36500.0,58200.0,30700.0,118000.0,HASS
49,Spanish,34000.0,53100.0,31000.0,96400.0,HASS


**Access a particular column from a data frame**

In [19]:
clean_df['Starting Median Salary']

0     46000.0
1     57700.0
2     42600.0
3     36800.0
4     41600.0
5     35800.0
6     38800.0
7     43000.0
8     63200.0
9     42600.0
10    53900.0
11    38100.0
12    61400.0
13    55900.0
14    53700.0
15    35000.0
16    35900.0
17    50100.0
18    34900.0
19    60900.0
20    38000.0
21    37900.0
22    47900.0
23    39100.0
24    41200.0
25    43500.0
26    35700.0
27    38800.0
28    39200.0
29    37800.0
30    57700.0
31    49100.0
32    36100.0
33    40900.0
34    35600.0
35    49200.0
36    40800.0
37    45400.0
38    57900.0
39    35900.0
40    54200.0
41    39900.0
42    39900.0
43    74300.0
44    50300.0
45    40800.0
46    35900.0
47    34100.0
48    36500.0
49    34000.0
Name: Starting Median Salary, dtype: float64

**To find the highest starting salary we can simply chain the .max() method:**

In [20]:
clean_df['Starting Median Salary'].max()

74300.0

In [21]:
clean_df['Starting Median Salary'].max

<bound method Series.max of 0     46000.0
1     57700.0
2     42600.0
3     36800.0
4     41600.0
5     35800.0
6     38800.0
7     43000.0
8     63200.0
9     42600.0
10    53900.0
11    38100.0
12    61400.0
13    55900.0
14    53700.0
15    35000.0
16    35900.0
17    50100.0
18    34900.0
19    60900.0
20    38000.0
21    37900.0
22    47900.0
23    39100.0
24    41200.0
25    43500.0
26    35700.0
27    38800.0
28    39200.0
29    37800.0
30    57700.0
31    49100.0
32    36100.0
33    40900.0
34    35600.0
35    49200.0
36    40800.0
37    45400.0
38    57900.0
39    35900.0
40    54200.0
41    39900.0
42    39900.0
43    74300.0
44    50300.0
45    40800.0
46    35900.0
47    34100.0
48    36500.0
49    34000.0
Name: Starting Median Salary, dtype: float64>

**The highest starting salary is $74,300. But which college major earns this much on average? For this, we need to know the row number or index so that we can look up the name of the major. Lucky for us, the .idxmax() method will give us index for the row with the largest value:**

In [22]:
clean_df['Starting Median Salary'].idxmax()

43

**To see the name of the major that corresponds to that particular row, we can use the .loc (location) property:**

In [23]:
clean_df['Undergraduate Major'].loc[43]

'Physician Assistant'

**Here we are selecting both a column ('Undergraduate Major') and a row at index 43, so we are retrieving the value of a particular cell:**

In [24]:
clean_df['Undergraduate Major'][43]

'Physician Assistant'

**If you don't specify a particular column you can use the .loc property to retrieve an entire row:**


In [26]:
clean_df.loc[43]

Undergraduate Major                  Physician Assistant
Starting Median Salary                           74300.0
Mid-Career Median Salary                         91700.0
Mid-Career 10th Percentile Salary                66400.0
Mid-Career 90th Percentile Salary               124000.0
Group                                               STEM
Name: 43, dtype: object

**What college major has the highest mid-career salary? 
How much do graduates with this major earn?**

In [33]:
clean_df["Mid-Career Median Salary"]

0      77100.0
1     101000.0
2      71900.0
3      61500.0
4      76800.0
5      64900.0
6      64800.0
7      72100.0
8     107000.0
9      79900.0
10     90500.0
11     70000.0
12    105000.0
13     95500.0
14     88900.0
15     56300.0
16     56900.0
17     98600.0
18     52000.0
19    103000.0
20     64700.0
21     68500.0
22     88300.0
23     62600.0
24     65500.0
25     79500.0
26     59800.0
27     60600.0
28     71000.0
29     57500.0
30     94700.0
31     74800.0
32     53200.0
33     80900.0
34     66700.0
35     82300.0
36     79600.0
37     92400.0
38     93600.0
39     55000.0
40     67000.0
41     55300.0
42     81200.0
43     91700.0
44     97300.0
45     78200.0
46     60400.0
47     52000.0
48     58200.0
49     53100.0
Name: Mid-Career Median Salary, dtype: float64

In [28]:
clean_df["Mid-Career Median Salary"].max()

107000.0

In [29]:
clean_df["Mid-Career Median Salary"].idxmax()

8

In [34]:
clean_df['Undergraduate Major'].loc[8]

'Chemical Engineering'

In [35]:
clean_df['Undergraduate Major'][8]

'Chemical Engineering'

**Which college major has the lowest starting salary and how much do graduates earn after university?**

In [36]:
df.head()

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
0,Accounting,46000.0,77100.0,42200.0,152000.0,Business
1,Aerospace Engineering,57700.0,101000.0,64300.0,161000.0,STEM
2,Agriculture,42600.0,71900.0,36300.0,150000.0,Business
3,Anthropology,36800.0,61500.0,33800.0,138000.0,HASS
4,Architecture,41600.0,76800.0,50600.0,136000.0,Business


In [37]:
clean_df["Starting Median Salary"].max()

74300.0