# The Pandas Library

Pandas is an open source library that provides easy-to-use data structures and data analysis tools in Python. This library offers a two useful labelled array data structure called the Series and the Dataframes.

Ref: http://pandas.pydata.org

## Series

A series is used for a one-dimensional vector array with labels of each element in the vector as index. A series method can hold any type of data such as integer, scalar value, strings, Python objects, float, array, ndarray, etc.

The syntax for creating a series is

```python
     
    s = pd.Series(data, index=index)

```
An example is shown below on how to use the series in a program.

```python

    import pandas as pd
    import numpy as np
    list = pd.Series([2, 3, 4, 5, 6, 7])
    list
 
    Output:
 
    0    2
    1    3
    2    4
    3    5
    4    6
    5    7
    dtype: int64

```

An element can be accessed using its index or by its value. 

## Exercise

Assign apple, ball, cat, dog, elephant to a variable ' alphabet', and then access the element using the index for ball.
    

In [7]:
import pandas as pd
import numpy as np

alphabet = pd.Series(['apple', 'ball', 'cat', 'dog', 'elephant'])
print(alphabet,'\n')

alphabet[1]

0       apple
1        ball
2         cat
3         dog
4    elephant
dtype: object 



'ball'

### Solution

```python
alphabet = pd.Series(['apple', 'ball', 'cat', 'dog', 'elephant'])
alphabet[1]
```

In [12]:
import pandas as pd
sr = pd.Series([1,2,3], index = [100, 200, 300])
print(sr,'\n')
sr[200]

100    1
200    2
300    3
dtype: int64 



2

In [17]:
import pandas as pd
import numpy as np

df1 = pd.DataFrame([1,2,3,4,5])
df1

Unnamed: 0,0
0,1
1,2
2,3
3,4
4,5


In [18]:
df2 = pd.DataFrame([1,2,3,4,5], index=[10,20,30,40,50])
df2

Unnamed: 0,0
10,1
20,2
30,3
40,4
50,5


In [29]:
df3 = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'], index=[10,20])
print(df3,'\n')
df3.index = df3.b
print(df3,'\n')
df3.info()
df3.describe()

    a  b  c
10  1  2  3
20  4  5  6 

   a  b  c
b         
2  1  2  3
5  4  5  6 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 2 to 5
Data columns (total 3 columns):
a    2 non-null int64
b    2 non-null int64
c    2 non-null int64
dtypes: int64(3)
memory usage: 64.0 bytes


Unnamed: 0,a,b,c
count,2.0,2.0,2.0
mean,2.5,3.5,4.5
std,2.12132,2.12132,2.12132
min,1.0,2.0,3.0
25%,1.75,2.75,3.75
50%,2.5,3.5,4.5
75%,3.25,4.25,5.25
max,4.0,5.0,6.0


In [31]:
df4 = pd.DataFrame([[1,2.2,'',3],[4,5.5,'hi',6]], columns=['a','b','c','d'], index=[10,20])
print(df4,'\n')
df4.index = df4.b
print(df4,'\n')
df4.info()
df4.describe()

    a    b   c  d
10  1  2.2      3
20  4  5.5  hi  6 

     a    b   c  d
b                 
2.2  1  2.2      3
5.5  4  5.5  hi  6 

<class 'pandas.core.frame.DataFrame'>
Float64Index: 2 entries, 2.2 to 5.5
Data columns (total 4 columns):
a    2 non-null int64
b    2 non-null float64
c    2 non-null object
d    2 non-null int64
dtypes: float64(1), int64(2), object(1)
memory usage: 80.0+ bytes


Unnamed: 0,a,b,d
count,2.0,2.0,2.0
mean,2.5,3.85,4.5
std,2.12132,2.333452,2.12132
min,1.0,2.2,3.0
25%,1.75,3.025,3.75
50%,2.5,3.85,4.5
75%,3.25,4.675,5.25
max,4.0,5.5,6.0


In [32]:
df5 = pd.DataFrame([[1,2,'3'],[4,5,6],[7,8,9]])
print(df5,'\n')
df5.info()
df5.describe()

   0  1  2
0  1  2  3
1  4  5  6
2  7  8  9 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
0    3 non-null int64
1    3 non-null int64
2    3 non-null object
dtypes: int64(2), object(1)
memory usage: 152.0+ bytes


Unnamed: 0,0,1
count,3.0,3.0
mean,4.0,5.0
std,3.0,3.0
min,1.0,2.0
25%,2.5,3.5
50%,4.0,5.0
75%,5.5,6.5
max,7.0,8.0


In [33]:
df6 = pd.DataFrame(np.array([[1,2,'3'],[4,5,6],[7,8,9]]))
print(df6,'\n')
df6.info()
df6.describe()

   0  1  2
0  1  2  3
1  4  5  6
2  7  8  9 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
0    3 non-null object
1    3 non-null object
2    3 non-null object
dtypes: object(3)
memory usage: 152.0+ bytes


Unnamed: 0,0,1,2
count,3,3,3
unique,3,3,3
top,4,2,9
freq,1,1,1


## Dataframes

Dataframes is defined as a 2-dimensional labeled data structure with columns of potentially different types according to pydata.org. It is similar to a spreadsheet or SQL table, or a dict of Series objects.

<img src="https://s3.amazonaws.com/rfv2/dataframe.png", style="width: 300px;">

To use pandas library, first you have to import the library and use the funcations available in the library

```python
import pandas as pd
import numpy as np
pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]))
```

### Description

Let us read in a csv file into a dataframe object. The pandas dataframe is so versatile that it can easily read files from url links. 

You can think of Dataframes to be as important for a Data Scientist as Cutting is for a cook. To be able to learn the various operations is essential for you to be a strong Data Scientist. 

To understand how dataframe works, let us consider the datasets used by FiveThirtyEight.com to analyzie the earning ecomonics of college majors and to do a midwest states survey.

In these examples let us read in a csv file into a dataframe object. The pandas dataframe is so versatile that it can easily read files from url links. 

### College Majors datasets

Article ref: The Economic Guide To Picking A College Major: 
https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/

"Major categories are from Carnevale et al: "What's It Worth?: The Economic Value of College Majors." Georgetown University Center on Education and the Workforce, 2011. http://cew.georgetown.edu/whatsitworth"

Data is from American Community Survey 2010-2012 Public Use Microdata Series from:

http://www.census.gov/acs/www/data_documentation/pums_data/

CSV datasets can be found at: https://github.com/colaberry/538data/tree/master/college-majors

### Midwest Survey datasets

Example context:

"Here’s a somewhat regular argument I get in: Which states make up which regions of the United States? Some of these regions — the West Coast, Mountain States, Southwest and Northeast are pretty clearly defined — but two other regions, the South and the Midwest, are more nebulous.

I’m from New York, and I generally consider anything west of Philadelphia the Midwest. This admittedly unsophisticated designation is frequently criticized by self-avowed Midwesterners. My boss, originally of Michigan, has many opinions about what, precisely, falls into the Midwest. So I decided to find out which states Midwesterners consider to be in their territory.

To get this broad-based view, we asked SurveyMonkey Audience to ask self-identified Midwesterners which states make the cut. We ran a national survey that targeted the Midwest from March 12 to March 17, with 2,778 respondents. Of those, 1,357 respondents identified “a lot” or “some” as a Midwesterner. We then asked this group to identify the states they consider part of the Midwest."

Article ref: Which States Are in the Midwest? https://fivethirtyeight.com/datalab/what-states-are-in-the-midwest/

SOUTH.csv and MIDWEST.csv contain individual responses from surveys about regional identification conducted for FiveThirtyEight by SurveyMonkey. 

CSV datasets can be found at: https://github.com/colaberry/538data/tree/master/region-survey

## Exercise:

### Reading into a Dataframe

Dataframes are a part of the library called Pandas. One of the versatile features of pandas dataframe is that, it can read the data by just specifying the link in the read_csv() function.

 - One of the useful and fundamental commands that can list the first few rows of the dataframe, is the 'head' function, which takes in number of rows (default is 5) to be listed in the top-down order. 
 - List first five rows of the grad_students dataframe using the 'head' function and assign it to grad_sample variable. 

In [34]:
# Import the library
import pandas as pd
import numpy as np

grad_url = "https://raw.githubusercontent.com/colaberry/538data/master/college-majors/grad-students.csv"
grad_students = pd.read_csv(grad_url)

#Use .head() function
#write your code here
grad_sample = grad_students.head()
print(grad_sample)

   Major_code                                   Major  \
0        5601                   CONSTRUCTION SERVICES   
1        6004       COMMERCIAL ART AND GRAPHIC DESIGN   
2        6211                  HOSPITALITY MANAGEMENT   
3        2201  COSMETOLOGY SERVICES AND CULINARY ARTS   
4        2001              COMMUNICATION TECHNOLOGIES   

                        Major_category  Grad_total  Grad_sample_size  \
0  Industrial Arts & Consumer Services        9173               200   
1                                 Arts       53864               882   
2                             Business       24417               437   
3  Industrial Arts & Consumer Services        5411                72   
4              Computers & Mathematics        9109               171   

   Grad_employed  Grad_full_time_year_round  Grad_unemployed  \
0           7098                       6511              681   
1          40492                      29553             2482   
2          18368                

### Solution

```python
grad_sample = grad_students.head()
grad_sample
```

## Using head() function with a parameter

We can pass the number of rows as a parameter to head() function.

### Exercise:

- Now, try listing first 7 rows of the midwest_survey dataframe using the head() function and assign it to midwest_sample variable.

In [35]:
# Import the library
import pandas as pd
import numpy as np

midwest_survey_url = "https://raw.githubusercontent.com/colaberry/538data/master/region-survey/MIDWEST.csv"
midwest_survey = pd.read_csv(midwest_survey_url)

#Use value in .head() function by passing the number of rows.
#write your code below
midwest_sample = midwest_survey.head(7)
print(midwest_sample)

   RespondentID  \
0           NaN   
1  3.126807e+09   
2  3.126802e+09   
3  3.126791e+09   
4  3.126781e+09   
5  3.126780e+09   
6  3.126774e+09   

  In your own words, what would you call the part of the country you live in now?  \
0                                Open-Ended Response                                
1                                           Southern                                
2                                            Midwest                                
3                                            Midwest                                
4                                           Mid-west                                
5                                            Midwest                                
6                                                 US                                

  How much, if at all, do you personally identify as a Midwesterner?  \
0                                              A lot                   
1                     

### Solution

```python
midwest_sample = midwest_survey.head(7)
midwest_sample
```

### Writing a Dataframe into a csv file

To save a dataframe into a csv file 'to_csv' is used. 

```python
    df.to_csv('filename.csv')
```    
         
where df is the name of the dataframe and filename.csv is the name you want your csv file to be named.

to_csv uses many arguments using 'sep'

```python
    df.to_csv(file_name, sep=',')
```  
As the Dataframe will in Series form, the output will have indexes and if we want to avoid the indexes to be copied to the csv file, we need to pass 'False' value to the 'index' argument.

```python
    df.to_csv(file_name, encoding='utf-8', index=False)
```

## Exercise

What is the syntax for writing a dataframe 'grad_students' to a csv file 'gradstudents.csv' without any index values stored in the csv file?


In [37]:
import pandas as pd
import numpy as np
grad_url = "https://raw.githubusercontent.com/colaberry/538data/master/college-majors/grad-students.csv"
grad_students = pd.read_csv(grad_url)

student_doc = 'gradstudents.csv'
grad_students.to_csv(student_doc, encoding='utf-8', index=False)

grad_students2 = pd.read_csv(student_doc)
grad_students2.head()

Unnamed: 0,Major_code,Major,Major_category,Grad_total,Grad_sample_size,Grad_employed,Grad_full_time_year_round,Grad_unemployed,Grad_unemployment_rate,Grad_median,...,Nongrad_total,Nongrad_employed,Nongrad_full_time_year_round,Nongrad_unemployed,Nongrad_unemployment_rate,Nongrad_median,Nongrad_P25,Nongrad_P75,Grad_share,Grad_premium
0,5601,CONSTRUCTION SERVICES,Industrial Arts & Consumer Services,9173,200,7098,6511,681,0.087543,75000.0,...,86062,73607,62435,3928,0.050661,65000.0,47000,98000.0,0.09632,0.153846
1,6004,COMMERCIAL ART AND GRAPHIC DESIGN,Arts,53864,882,40492,29553,2482,0.057756,60000.0,...,461977,347166,250596,25484,0.068386,48000.0,34000,71000.0,0.10442,0.25
2,6211,HOSPITALITY MANAGEMENT,Business,24417,437,18368,14784,1465,0.073867,65000.0,...,179335,145597,113579,7409,0.048423,50000.0,35000,75000.0,0.119837,0.3
3,2201,COSMETOLOGY SERVICES AND CULINARY ARTS,Industrial Arts & Consumer Services,5411,72,3590,2701,316,0.080901,47000.0,...,37575,29738,23249,1661,0.0529,41600.0,29000,60000.0,0.125878,0.129808
4,2001,COMMUNICATION TECHNOLOGIES,Computers & Mathematics,9109,171,7512,5622,466,0.058411,57000.0,...,53819,43163,34231,3389,0.0728,52000.0,36000,78000.0,0.144753,0.096154


### Solution

```python
student_doc = "gradstudents.csv"
grad_students.to_csv(student_doc, encoding='utf-8', index=False)

```

## Categorical & Dummy Variables


<img src="../../../images/types_of_data.png" style="width: 700px;">

Categorical variables are variables that can take a value in a fixed set. For example, day of the week,

```python
day = {Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday}
```
Frequently, Data Scientists encounter columns which have categorical variables in the columns. For example, consider the dataframe which contains data about various purchases tabulated by each day and the sex of the person who made the purchase. We can create a dataframe from the dictionary of values:

```python
raw_data = {'Day': ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday',
                    'Sunday', 'Monday', 'Tuesday', 'Wednesday'],
            'Sex': ['male', 'female', 'male', 'female', 'female', 'female', 'female',
                    'male', 'male', 'female'],
            'Total_Amount': ['100.0', '30', '40', '70', '90', '20', '50', '30',
                             '60', '70']}

checks = pd.DataFrame(raw_data, columns = ['Day', 'Sex', 'Total_Amount'])
```
checks dataframe now contains:

<pre>
  Total_Amount        Day     Sex
0        100.0     Monday    male
1           30    Tuesday  female
2           40  Wednesday    male
3           70   Thursday  female
4           90     Friday  female
5           20   Saturday  female
6           50     Sunday  female
7           30     Monday    male
8           60    Tuesday    male
9           70  Wednesday  female
</pre>

Certain Machine Learning algorithms cannot make sense of the categorical variables such as Day or Sex. Hence, we need to transform them to numbered codes that represent such unique values. To do so, we can use dummy variables. Dummy variables transform the day of the week to a value:

```python
day_dummies = pd.get_dummies(checks['Day'])
day_dummies
```
<pre>
   Friday  Monday  Saturday  Sunday  Thursday  Tuesday  Wednesday
0       0       1         0       0         0        0          0
1       0       0         0       0         0        1          0
2       0       0         0       0         0        0          1
3       0       0         0       0         1        0          0
4       1       0         0       0         0        0          0
5       0       0         1       0         0        0          0
6       0       0         0       1         0        0          0
7       0       1         0       0         0        0          0
8       0       0         0       0         0        1          0
9       0       0         0       0         0        0          1
</pre>

You can see that there are as many columns as there are unique values in the categorical variable. Also each categorical variable such as Monday, Tuesday have unique set of 1s and 0s.

<br/>

### Exercise:

 - Transform the 'sex' column to a categorical variable and append to the dataframe, checks.

In [41]:
raw_data = {'Day': ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday','Sunday', 'Monday', 'Tuesday', 'Wednesday'],
                      'Sex': ['male', 'female', 'male', 'female', 'female', 'female', 'female', 'male', 'male', 'female'],
                      'Total_Amount': ['100.0', '30', '40', '70', '90', '20', '50', '30','60', '70']}
checks = pd.DataFrame(raw_data, columns = ['Total_Amount', 'Day', 'Sex'])

#Use pd.concat([<dataframe>, <sex dataframe>, axis=1) and display the dataframe.
# Add code to create a new dummy variable out of Sex column and append to checks data frame.

sex_dummies = pd.get_dummies(checks['Sex'])
print(sex_dummies)
pd.concat((checks, sex_dummies), axis=1)

   female  male
0       0     1
1       1     0
2       0     1
3       1     0
4       1     0
5       1     0
6       1     0
7       0     1
8       0     1
9       1     0


Unnamed: 0,Total_Amount,Day,Sex,female,male
0,100.0,Monday,male,0,1
1,30.0,Tuesday,female,1,0
2,40.0,Wednesday,male,0,1
3,70.0,Thursday,female,1,0
4,90.0,Friday,female,1,0
5,20.0,Saturday,female,1,0
6,50.0,Sunday,female,1,0
7,30.0,Monday,male,0,1
8,60.0,Tuesday,male,0,1
9,70.0,Wednesday,female,1,0


### Solution

```python
sex_dummies = pd.get_dummies(checks['Sex'])
checks = pd.concat((checks, sex_dummies), axis=1)
```

# Indexing 

Indexing is done using labels which is attributes used for column representation and numeric ranges

```python
    grad_students['Major']
```

We can also pass a set of attribute names to select those colums from the dataset.

```python
    grad_students[['Major', 'Grad_total']]
```


# Slicing

Slicing is done to pull out a set of rows from the dataset.

```python
    grad_students[2:5]
```

To slice out a set of rows, you use the following syntax: data[start:stop].

[[We can select specific ranges of our data in both the row and column directions using either label or integer-based indexing.

loc: indexing via labels or integers
iloc: indexing via integers ]] --------- explanation for iloc and loc

### Slicing Data based on Criteria
We can also select a subset of our data using criteria. For example, we can select all rows that have a year value of 2002.

```python
    grad_students[grad_students.Grad_total <= 10000]
```

## Exercise

Print the rows of the data with the condition of grad_unemployed being greater than equal to 700 and store it in a variable 'unempl_num'


In [1]:
import pandas as pd
import numpy as np
grad_url = "https://raw.githubusercontent.com/colaberry/538data/master/college-majors/grad-students.csv"
grad_students = pd.read_csv(grad_url)

unempl_num = grad_students[grad_students.Grad_unemployed >= 700]
print(type(unempl_num))
unempl_num

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Major_code,Major,Major_category,Grad_total,Grad_sample_size,Grad_employed,Grad_full_time_year_round,Grad_unemployed,Grad_unemployment_rate,Grad_median,...,Nongrad_total,Nongrad_employed,Nongrad_full_time_year_round,Nongrad_unemployed,Nongrad_unemployment_rate,Nongrad_median,Nongrad_P25,Nongrad_P75,Grad_share,Grad_premium
1,6004,COMMERCIAL ART AND GRAPHIC DESIGN,Arts,53864,882,40492,29553,2482,0.057756,60000.0,...,461977,347166,250596,25484,0.068386,48000.0,34000,71000.0,0.104420,0.250000
2,6211,HOSPITALITY MANAGEMENT,Business,24417,437,18368,14784,1465,0.073867,65000.0,...,179335,145597,113579,7409,0.048423,50000.0,35000,75000.0,0.119837,0.300000
6,6206,MARKETING AND MARKETING RESEARCH,Business,190996,3738,151570,123045,8324,0.052059,80000.0,...,1029181,817906,662346,45519,0.052719,60000.0,40000,91500.0,0.156531,0.333333
9,1904,ADVERTISING AND PUBLIC RELATIONS,Communications & Journalism,33928,688,28517,22523,899,0.030562,60000.0,...,163435,127832,100330,8706,0.063762,51000.0,37800,78000.0,0.171907,0.176471
10,6005,FILM VIDEO AND PHOTOGRAPHIC ARTS,Arts,24525,370,19059,13301,2035,0.096473,57000.0,...,116158,93915,63674,8160,0.079941,50000.0,32000,75000.0,0.174328,0.140000
13,1903,MASS MEDIA,Communications & Journalism,42915,828,35939,28054,1957,0.051641,57000.0,...,190020,153722,117581,12816,0.076955,50000.0,35000,72000.0,0.184236,0.140000
14,5901,TRANSPORTATION SCIENCES AND TECHNOLOGIES,Industrial Arts & Consumer Services,27410,538,20035,18088,980,0.046633,90000.0,...,121260,94538,80650,4326,0.043757,69000.0,45000,100000.0,0.184368,0.304348
15,2107,COMPUTER NETWORKING AND TELECOMMUNICATIONS,Computers & Mathematics,11165,218,9037,7988,803,0.081606,80000.0,...,48776,41552,34402,2476,0.056237,58000.0,37700,84000.0,0.186266,0.379310
16,6299,MISCELLANEOUS BUSINESS & MEDICAL ADMINISTRATION,Business,22553,408,17691,14807,865,0.046616,75000.0,...,95860,72005,58441,3694,0.048799,55000.0,38000,85000.0,0.190461,0.363636
20,5301,CRIMINAL JUSTICE AND FIRE PROTECTION,Law & Public Policy,188228,3794,154146,133549,6591,0.041005,68000.0,...,695361,565748,490943,29133,0.048973,52000.0,36000,75000.0,0.213027,0.307692


### Solution


```python
unempl_num = grad_students[grad_students.Grad_unemployed >= 700]
unempl_num
```



<br/><br/><br/>
## Row Selection

You could see that the column indices were shown in the head() command. The dataframes are indexed by starting index of 0 on both row and column similar to array indexing. Note that the upper bound is excluded from listing in the output. To list rows from 2 to 6 of a column 'Major':

```python
grad_students['Major'][1:6]
```
<img src="https://s3.amazonaws.com/rfv2/row_slice.png", style="width: 350px;">

<br/>
## Exercise:

 - Load rows 2 to 6 of column Major_category and assign it to variable major_category.
 - print out the variable major_category

In [2]:
# list rows 2:6 of Major_category
#note that the index starts with 0
#write your code below

major_category = grad_students['Major_category'][1:6]
print(major_category)

1                                   Arts
2                               Business
3    Industrial Arts & Consumer Services
4                Computers & Mathematics
5                    Law & Public Policy
Name: Major_category, dtype: object


### Solution

```python
major_category = grad_students['Major_category'][1:6]
major_category
```



<br/><br/><br/>
## Column Selection

To list multiple columns use double square brackets. For example, to list Major_category, Grad_share you can use:
```python
grad_students[['Major_category', 'Grad_share']]
```

<img src="https://s3.amazonaws.com/rfv2/col_slice.png", style="width: 350px;">

<br/>

### Exercise:

From the Graduate Students dataset (stored in grad_students):
 - List columns of Major_code and Grad_employed for rows 4 to 8 and assign it to the variable, grad_students_sample.
 - print out grad_students_sample

In [3]:
grad_students_sample = grad_students[['Major_code', 'Grad_employed']][3:8]

grad_students_sample

Unnamed: 0,Major_code,Grad_employed
3,2201,3590
4,2001,7512
5,3201,1008
6,6206,151570
7,1101,13104


### Solution
```python
grad_students_sample = grad_students[['Major_code', 'Grad_employed']][3:8]
grad_students_sample
```



<br/><br/><br/>
## Assignments & Comparisons

### Adding a new column

Adding a new column with values is as easy as a variable assignment. To add a new column of percentage of employed grads:
```python
grad_students['emp_percent'] = (grad_students.Grad_employed/grad_students.Grad_total) * 100.0
```

### Renaming a Column

Renaming a column is easy by specifying a dictionary of old name as key and new name as value.
```python
grad_students = grad_students.rename(columns={'Major_category': 'Major_Category'}
```

### Comparisons

To determine those rows where sample size is greater than 200,
```python
grad_students[grad_students.Grad_sample_size > 200]
```

<br/>
## Exercise:

List the Major Category of all the data where the employment percentage is greater than 80%.

 - Assign it to the variable major_emp_data and print it out.

In [4]:
grad_students['emp_percent'] = (grad_students.Grad_employed/grad_students.Grad_total) * 100.0
grad_students[grad_students.emp_percent > 80]

#use grad_students.Major_category
#Write your code below:
major_emp_data = grad_students[grad_students.emp_percent > 80].Major_category
major_emp_data

4                  Computers & Mathematics
8                  Computers & Mathematics
9              Communications & Journalism
13             Communications & Journalism
15                 Computers & Mathematics
17                             Engineering
20                     Law & Public Policy
23                                Business
24                 Computers & Mathematics
25                                Business
27                                  Health
28                 Computers & Mathematics
30                                Business
34                             Engineering
38                 Computers & Mathematics
40                       Interdisciplinary
41                       Physical Sciences
45                Psychology & Social Work
46                                    Arts
50     Industrial Arts & Consumer Services
51                                Business
60                 Computers & Mathematics
63                  Biology & Life Science
66         

### Solution

```python
major_emp_data = grad_students.Major_category[grad_students.emp_percent > 80]
major_emp_data
```



<br/><br/><br/>
## Statistical Computations

### Statistical Measures

statistics such as mean and median can be easily computed on selective values in the dataframe. 

To compute mean on Grad_employed:
```python
grad_students.Grad_employed.mean()
```
To calculate the sum of each column:
```python
grad_students.Grad_employed.sum()
```

### Sparsity or Missing Values

<img src="../../../images/python_missing_values.png", style="width: 700px;">

In many real world datasets, it is common to encounter missing or null values. 

To find missing or null values in a column, you can use the isnull() function. For this, let us study the midwest survey dataset.
```python
midwest_survey['Which of the following states do you consider part of the Midwest? Please select all that apply. '].isnull()
```
To drop a row where the above column value is null:
```python
midwest_survey['Which of the following states do you consider part of the Midwest? Please select all that apply. '].dropna()
```

### Sparsity Substitution

When a few values are missing, it is not worth throwing away the entire row as we would be missing a lot of useful information.  We can fill such columns with 'NA' and preserve the row for later analysis
```python
midwest_survey['Which of the following states do you consider part of the Midwest? Please select all that apply. '].fillna(value='NA')
```
The above statement does not change the original dataframe. We can do that by specifying inplace=True
```python
midwest_survey['Which of the following states do you consider part of the Midwest? Please select all that apply. '].fillna(value='NA', inplace=True)
```
We can also substitute with another value:
```python
midwest_survey['Which of the following states do you consider part of the Midwest? Please select all that apply. '].fillna(value=0.5)
```
To see statistical data of an entire dataframe:

```python
grad_students.describe(include='all')
```

<br/>
## Exercise:

Determine the Major for which the Grad_employed is maximum.

 - Assign it to variable grad_emp_max
 - print it out.

In [5]:
# Use grad_students.Grad_employed.max()
#Use == comparator and find out which row/s contains the maximum value. Then print out the Major.
#write your code below:

grad_emp_max = grad_students[grad_students.Grad_employed == grad_students.Grad_employed.max()].Major
print(grad_emp_max)

119    PSYCHOLOGY
Name: Major, dtype: object


### Solution

```python
grad_emp_max = grad_students[grad_students.Grad_employed == grad_students.Grad_employed.max()].Major
grad_emp_max
```




<br/><br/><br/>
## Advanced operations - Dataframe

Dataframe has advanced operations that can be used to filter data. To retrieve data about grad students whose major is engineering and employed percentage is above 80,
```python
grad_students[(grad_students.Major_category == 'Engineering') & (grad_students.emp_percent > 80)]
```
Similarly you can use OR with the single symbol |. Note that there are no double parameters for comparison as in other languages such as C.


## Exercise:

List the grad students who are Major in "Art or Graphic Design" or "Engineering" whose unemployment rate is below 10%. 

 - Assign the result to variable, grad_select_list
 - Print the value

In [6]:
# Use | and & to build your command

#write your code below:
grad_select_list = grad_students[(grad_students.Major == 'COMMERCIAL ART AND GRAPHIC DESIGN') | (grad_students.Major == 'Engineering') & (grad_students.Grad_unemployment_rate<0.1)]

grad_select_list

Unnamed: 0,Major_code,Major,Major_category,Grad_total,Grad_sample_size,Grad_employed,Grad_full_time_year_round,Grad_unemployed,Grad_unemployment_rate,Grad_median,...,Nongrad_employed,Nongrad_full_time_year_round,Nongrad_unemployed,Nongrad_unemployment_rate,Nongrad_median,Nongrad_P25,Nongrad_P75,Grad_share,Grad_premium,emp_percent
1,6004,COMMERCIAL ART AND GRAPHIC DESIGN,Arts,53864,882,40492,29553,2482,0.057756,60000.0,...,347166,250596,25484,0.068386,48000.0,34000,71000.0,0.10442,0.25,75.174514


### Solution

```python
grad_select_list = grad_students[((grad_students.Major == 'ENGINEERING') | (grad_students.Major == 'COMMERCIAL ART AND GRAPHIC DESIGN')) & (grad_students.Grad_unemployment_rate < 0.1)]
grad_select_list
```

In [57]:
grad_students[grad_students.Major_category.str.contains('Science')]

Unnamed: 0,Major_code,Major,Major_category,Grad_total,Grad_sample_size,Grad_employed,Grad_full_time_year_round,Grad_unemployed,Grad_unemployment_rate,Grad_median,...,Nongrad_full_time_year_round,Nongrad_unemployed,Nongrad_unemployment_rate,Nongrad_median,Nongrad_P25,Nongrad_P75,Grad_share,Grad_premium,emp_percent,Grad_total_range
22,5503,CRIMINOLOGY,Social Science,18499,381,14584,12074,692,0.0453,65000.0,...,45563,3189,0.056335,50000.0,35000,75000.0,0.216158,0.3,78.836694,10000 to 50000
41,5102,"NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL ...",Physical Sciences,3910,65,3129,2565,28,0.008869,80000.0,...,6701,605,0.065026,64000.0,45000,80000.0,0.261032,0.25,80.025575,0 to 10000
63,1301,ENVIRONMENTAL SCIENCE,Biology & Life Science,43612,925,38068,29568,1392,0.035276,68000.0,...,61721,3837,0.046416,55000.0,40000,76000.0,0.315764,0.236364,87.287902,10000 to 50000
64,5504,GEOGRAPHY,Social Science,52147,1008,38935,30602,1251,0.03113,73000.0,...,60831,5680,0.06727,55000.0,40000,80000.0,0.324086,0.327273,74.663931,50001 to 100000
66,3604,ECOLOGY,Biology & Life Science,21535,465,17793,13713,677,0.036654,62000.0,...,24530,1665,0.047307,48700.0,33000,74000.0,0.341473,0.273101,82.623636,10000 to 50000
67,4007,INTERDISCIPLINARY SOCIAL SCIENCES,Social Science,30233,541,22755,17138,1444,0.059672,66000.0,...,29613,2493,0.058323,46000.0,34000,70000.0,0.344073,0.434783,75.265438,10000 to 50000
72,5098,MULTI-DISCIPLINARY OR GENERAL SCIENCE,Physical Sciences,223467,3494,162694,126999,4705,0.028106,86000.0,...,229673,13226,0.043517,59000.0,40000,86000.0,0.357429,0.457627,72.804486,Above 100000
77,5507,SOCIOLOGY,Social Science,368710,6155,259742,191949,11353,0.041878,64000.0,...,318869,28396,0.062997,49000.0,35000,71000.0,0.370289,0.306122,70.44615,Above 100000
78,5500,GENERAL SOCIAL SCIENCES,Social Science,73350,1069,45744,34005,2841,0.058475,69000.0,...,55727,5499,0.066579,50000.0,35000,75000.0,0.373895,0.38,62.364008,50001 to 100000
95,5501,ECONOMICS,Social Science,517614,9822,386002,315920,17668,0.043768,100000.0,...,409200,29246,0.055633,70000.0,44000,111000.0,0.424883,0.428571,74.573331,Above 100000




<br/><br/><br/>
## Index Based Selection

<img src="../../../images/python_index_based_selection.png" style="width: 700px;">
Indices can be used just as in arrays or lists when selection is in a sequence instead of multiple disconnected rows/columns. To extract rows 1 to 4 and columns 4 to 7, you can use:
```python
grad_students.ix[0:3, 3:7]
```

## Exercise:

To select all the data in the row or column, you don't need to specify the start and end index numbers, but just the colon. 

 - Filter all rows for columns 2 to 6 and assign to variable grad_row_sample.
 - Print out the first 5 rows

In [7]:
#Use : in the column space
# Write your code below

grad_row_sample = grad_students.iloc[:,1:6]
grad_row_sample.head()

Unnamed: 0,Major,Major_category,Grad_total,Grad_sample_size,Grad_employed
0,CONSTRUCTION SERVICES,Industrial Arts & Consumer Services,9173,200,7098
1,COMMERCIAL ART AND GRAPHIC DESIGN,Arts,53864,882,40492
2,HOSPITALITY MANAGEMENT,Business,24417,437,18368
3,COSMETOLOGY SERVICES AND CULINARY ARTS,Industrial Arts & Consumer Services,5411,72,3590
4,COMMUNICATION TECHNOLOGIES,Computers & Mathematics,9109,171,7512


### Solution

```python
grad_row_sample = grad_students.iloc[:, 1:6]
grad_row_sample
```



<br/><br/><br/>
## Sorting & Grouping

### Sorting

To sort a column in a dataframe, you can use sort _index method and specify the column. To sort by sample size in the ascending order in the grad student dataframe:
```python
grad_students.sort_index(by='Grad_sample_size')
```

## Grouping

<img src="../../../images/groupbyPython-1.png">
You can use groupby extension to group by the elements of the column.
```python
grad_students.groupby('Major_category')
```
<br/>

## Exercise:

You can use max() function to determine the max of any column. For example to determine the max employed graduates:
```python
grad_students.Grad_employed.max()
```
Can you determine what is the max graduates employed in each Major category?

 - Assign the result to the variable, grad_cat_max and print it out.

In [8]:
# Modify the code below, Use groupby before using the max function.
grad_cat_max = grad_students.groupby('Major_category').Grad_employed.max()
grad_cat_max



Major_category
Agriculture & Natural Resources         47755
Arts                                   150394
Biology & Life Science                 898342
Business                               622357
Communications & Journalism            212156
Computers & Mathematics                287467
Education                              698049
Engineering                            371723
Health                                 437115
Humanities & Liberal Arts              598806
Industrial Arts & Consumer Services    103790
Interdisciplinary                       12708
Law & Public Policy                    154146
Physical Sciences                      336838
Psychology & Social Work               915341
Social Science                         548199
Name: Grad_employed, dtype: int64

### Solution

```python
grad_cat_max = grad_students.groupby('Major_category').Grad_employed.max()
grad_cat_max
```

# Reshaping

### Pivot 

The Pivot method is used to seperate and extract a new table from the original dataset.This methods are very useful in summerizing the data in the dataset. They can automatically perform mean, sort, count from the dataframe and store the result in a new table. This method allows us to group or organize data as per our wish from the existing dataset. 

The Pivot method can have index, columns, and values as arguments and they do not create a duplicate values for the specified column.


### Pivot Table

The pivot_table method overcomes the disadvantage of pivot method. The Pivot table performs the same role as Pivot method but it combines the values from with duplicate values for the specified columns.



```python 
import pandas as pd
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 
        'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 
        'TestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'TestScore'])
df

Output:

    regiment    company TestScore
0   Nighthawks  1st     4
1   Nighthawks  1st     24
2   Nighthawks  2nd     31
3   Nighthawks  2nd     2
4   Dragoons    1st     3
5   Dragoons    1st     4
6   Dragoons    2nd     24
7   Dragoons    2nd     31
8   Scouts      1st     2
9   Scouts      1st     3
10  Scouts      2nd     2
11  Scouts      2nd     3

```

a pivot table can be created for a group means, by company and regiment


pd.pivot_table(df, index=['regiment','company'], aggfunc='mean')

Pivot table method is a powerful method for analysis for which one need to first understand the data properly.


### Stack/Unstack

Pivoting a table is a special case of stacking a DataFrame. If suppose we have a DataFrame with MultiIndices on the rows and columns. Stacking a DataFrame means moving (also rotating or pivoting) the innermost column index to become the innermost row index. The inverse operation is called unstacking. It means moving the innermost row index to become the innermost column index.

## Exercise

Print out a pivot table on grad_students for a group max by Grad_unemployed and Grad_employed and store it in a variable mean_pivot

In [16]:
#write your code below:
pivot_table = pd.pivot_table(grad_students, index=['Grad_unemployed','Grad_employed'])
pivot_table

Unnamed: 0_level_0,Unnamed: 1_level_0,Grad_P25,Grad_P75,Grad_full_time_year_round,Grad_median,Grad_premium,Grad_sample_size,Grad_share,Grad_total,Grad_unemployment_rate,Major_code,Nongrad_P25,Nongrad_P75,Nongrad_employed,Nongrad_full_time_year_round,Nongrad_median,Nongrad_total,Nongrad_unemployed,Nongrad_unemployment_rate,emp_percent
Grad_unemployed,Grad_employed,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
0,1008,55000,120000.0,860,75000.0,0.500000,22,0.147376,1542,0.000000,3201,34000,75000.0,6967,6063,50000.0,8921,518,0.069205,65.369650
28,3129,55000,106000.0,2565,80000.0,0.250000,65,0.261032,3910,0.008869,5102,45000,80000.0,8699,6701,64000.0,11069,605,0.065026,80.025575
34,2284,50000,91000.0,1641,65000.0,0.000000,61,0.348230,3335,0.014668,1106,41000,89000.0,4654,3917,65000.0,6242,264,0.053680,68.485757
44,1100,40000,76000.0,770,55000.0,0.195652,27,0.197448,1733,0.038462,6099,30000,60000.0,5220,3556,46000.0,7044,1001,0.160907,63.473745
72,1248,37900,123000.0,1040,74000.0,0.156250,29,0.447501,3465,0.054545,3801,39000,90000.0,1650,1671,64000.0,4278,187,0.101796,36.017316
79,2673,75000,140800.0,1905,105000.0,0.206897,66,0.398745,3940,0.028706,2411,57000,125000.0,3797,3165,87000.0,5941,0,0.000000,67.842640
100,8567,85000,148000.0,8018,110000.0,0.100000,243,0.547688,10520,0.011538,2418,65000,130000.0,6730,6258,100000.0,8688,144,0.020949,81.435361
112,5480,60000,150000.0,4548,96000.0,0.129412,136,0.618866,6331,0.020029,5001,56000,109000.0,2720,2163,85000.0,3899,298,0.098741,86.558206
112,5640,80000,200000.0,4869,124000.0,-0.015873,164,0.288075,7479,0.019471,2419,75000,215000.0,13108,11334,126000.0,18483,580,0.042373,75.411151
119,4716,56000,114000.0,3981,85000.0,0.416667,98,0.165394,5611,0.024612,2101,40000,85000.0,22024,18381,60000.0,28314,2222,0.091644,84.049189


In [17]:
pivot_table = pd.pivot_table(grad_students, index=['Major_category','Major'], aggfunc='max')
pivot_table

Unnamed: 0_level_0,Unnamed: 1_level_0,Grad_P25,Grad_P75,Grad_employed,Grad_full_time_year_round,Grad_median,Grad_premium,Grad_sample_size,Grad_share,Grad_total,Grad_unemployed,...,Major_code,Nongrad_P25,Nongrad_P75,Nongrad_employed,Nongrad_full_time_year_round,Nongrad_median,Nongrad_total,Nongrad_unemployed,Nongrad_unemployment_rate,emp_percent
Major_category,Major,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Agriculture & Natural Resources,AGRICULTURAL ECONOMICS,53000,120000.0,10592,8768,80000.0,0.269841,305,0.309306,14800,216,...,1102,40000,99000.0,25557,22496,63000.0,33049,734,0.027918,71.567568
Agriculture & Natural Resources,AGRICULTURE PRODUCTION AND MANAGEMENT,41600,100000.0,13104,11207,67000.0,0.218182,386,0.163965,17488,473,...,1101,38000,80000.0,71781,61335,55000.0,89169,1869,0.025377,74.931382
Agriculture & Natural Resources,ANIMAL SCIENCES,48000,104000.0,47755,39047,70300.0,0.464583,1335,0.374427,56807,596,...,1103,32000,75000.0,74896,61629,48000.0,94910,3101,0.039758,84.065344
Agriculture & Natural Resources,FOOD SCIENCE,50000,110000.0,10857,8074,72000.0,0.142857,266,0.388532,14521,370,...,1104,40000,92000.0,16298,12431,63000.0,22853,681,0.040108,74.767578
Agriculture & Natural Resources,FORESTRY,52000,110000.0,16831,14102,78000.0,0.322034,487,0.267567,24713,725,...,1302,42000,80000.0,46815,39048,59000.0,67649,1885,0.038706,68.105855
Agriculture & Natural Resources,GENERAL AGRICULTURE,45000,104000.0,28930,23024,68000.0,0.360000,764,0.263272,44306,874,...,1100,34000,80000.0,86631,72409,50000.0,123984,2352,0.026432,65.295897
Agriculture & Natural Resources,MISCELLANEOUS AGRICULTURE,45000,81000.0,2758,2276,54000.0,-0.018182,98,0.383420,5032,261,...,1199,39000,78000.0,5978,4707,55000.0,8092,239,0.038443,54.809221
Agriculture & Natural Resources,NATURAL RESOURCES MANAGEMENT,50000,100000.0,23394,19087,70000.0,0.320755,659,0.275761,29357,711,...,1303,38000,75000.0,60690,48256,53000.0,77101,3413,0.053242,79.687979
Agriculture & Natural Resources,PLANT SCIENCE AND AGRONOMY,45000,100000.0,22782,18312,67000.0,0.340000,624,0.289093,30983,735,...,1105,35000,75000.0,60241,49506,50000.0,76190,1899,0.030560,73.530646
Agriculture & Natural Resources,SOIL SCIENCE,50000,91000.0,2284,1641,65000.0,0.000000,61,0.348230,3335,34,...,1106,41000,89000.0,4654,3917,65000.0,6242,264,0.053680,68.485757


In [21]:
pivot_table = pd.pivot_table(grad_students, index=['Major_category','Major'],values=['Grad_employed'])
pivot_table

Unnamed: 0_level_0,Unnamed: 1_level_0,Grad_employed
Major_category,Major,Unnamed: 2_level_1
Agriculture & Natural Resources,AGRICULTURAL ECONOMICS,10592
Agriculture & Natural Resources,AGRICULTURE PRODUCTION AND MANAGEMENT,13104
Agriculture & Natural Resources,ANIMAL SCIENCES,47755
Agriculture & Natural Resources,FOOD SCIENCE,10857
Agriculture & Natural Resources,FORESTRY,16831
Agriculture & Natural Resources,GENERAL AGRICULTURE,28930
Agriculture & Natural Resources,MISCELLANEOUS AGRICULTURE,2758
Agriculture & Natural Resources,NATURAL RESOURCES MANAGEMENT,23394
Agriculture & Natural Resources,PLANT SCIENCE AND AGRONOMY,22782
Agriculture & Natural Resources,SOIL SCIENCE,2284


In [51]:
#if grad_total >=0 but <=10000 grad_total_range = '0 to 10000'
#if grad_total >=10001 but <=50000 grad_total_range = '10000 to 50000'
#if grad_total >=50001 but <=100000 grad_total_range = '50000 to 100000'
#if grad_total >=100001 and above grad_total_range = 'Above 10000'

def grad_range(value) :
    if value >=0 and value <=10000 :
        value = '0 to 10000'
    elif value >=10001 and value <=50000 :
        value = '10000 to 50000'
    elif value >=50001 and value <=100000 :
        value = '50001 to 100000'
    else :
        value = 'Above 100000'
    return value

grad_students['Grad_total_range'] = [grad_range(i) for i in grad_students['Grad_total']]

pivot_table = pd.pivot_table(grad_students, index=['Grad_total_range'], columns=['Major_category'], values=['Grad_employed'], aggfunc='max')
pivot_table

Unnamed: 0_level_0,Grad_employed,Grad_employed,Grad_employed,Grad_employed,Grad_employed,Grad_employed,Grad_employed,Grad_employed,Grad_employed,Grad_employed,Grad_employed,Grad_employed,Grad_employed,Grad_employed,Grad_employed,Grad_employed
Major_category,Agriculture & Natural Resources,Arts,Biology & Life Science,Business,Communications & Journalism,Computers & Mathematics,Education,Engineering,Health,Humanities & Liberal Arts,Industrial Arts & Consumer Services,Interdisciplinary,Law & Public Policy,Physical Sciences,Psychology & Social Work,Social Science
Grad_total_range,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
0 to 10000,2758.0,1100.0,7571.0,2020.0,,7512.0,,5640.0,,,7098.0,,1008.0,7443.0,4796.0,
10000 to 50000,28930.0,21628.0,38068.0,36227.0,35939.0,22403.0,15651.0,22501.0,37639.0,31558.0,20035.0,12708.0,31985.0,8657.0,24652.0,22755.0
50001 to 100000,47755.0,50690.0,66240.0,55106.0,,60858.0,63027.0,73147.0,54355.0,65878.0,,,,66748.0,38468.0,54693.0
Above 100000,,150394.0,898342.0,622357.0,212156.0,287467.0,698049.0,371723.0,437115.0,598806.0,103790.0,,154146.0,336838.0,915341.0,548199.0


In [54]:
grad_students[grad_students.Major_category.str.contains('Arts')]

Unnamed: 0,Major_code,Major,Major_category,Grad_total,Grad_sample_size,Grad_employed,Grad_full_time_year_round,Grad_unemployed,Grad_unemployment_rate,Grad_median,...,Nongrad_full_time_year_round,Nongrad_unemployed,Nongrad_unemployment_rate,Nongrad_median,Nongrad_P25,Nongrad_P75,Grad_share,Grad_premium,emp_percent,Grad_total_range
0,5601,CONSTRUCTION SERVICES,Industrial Arts & Consumer Services,9173,200,7098,6511,681,0.087543,75000.0,...,62435,3928,0.050661,65000.0,47000,98000.0,0.09632,0.153846,77.379265,0 to 10000
1,6004,COMMERCIAL ART AND GRAPHIC DESIGN,Arts,53864,882,40492,29553,2482,0.057756,60000.0,...,250596,25484,0.068386,48000.0,34000,71000.0,0.10442,0.25,75.174514,50001 to 100000
3,2201,COSMETOLOGY SERVICES AND CULINARY ARTS,Industrial Arts & Consumer Services,5411,72,3590,2701,316,0.080901,47000.0,...,23249,1661,0.0529,41600.0,29000,60000.0,0.125878,0.129808,66.346332,0 to 10000
10,6005,FILM VIDEO AND PHOTOGRAPHIC ARTS,Arts,24525,370,19059,13301,2035,0.096473,57000.0,...,63674,8160,0.079941,50000.0,32000,75000.0,0.174328,0.14,77.712538,10000 to 50000
11,5701,"ELECTRICAL, MECHANICAL, AND PRECISION TECHNOLO...",Industrial Arts & Consumer Services,3187,45,1984,1481,319,0.138515,62000.0,...,9949,653,0.051933,50000.0,32400,75000.0,0.176771,0.24,62.252902,0 to 10000
14,5901,TRANSPORTATION SCIENCES AND TECHNOLOGIES,Industrial Arts & Consumer Services,27410,538,20035,18088,980,0.046633,90000.0,...,80650,4326,0.043757,69000.0,45000,100000.0,0.184368,0.304348,73.093761,10000 to 50000
19,6099,MISCELLANEOUS FINE ARTS,Arts,1733,27,1100,770,44,0.038462,55000.0,...,3556,1001,0.160907,46000.0,30000,60000.0,0.197448,0.195652,63.473745,0 to 10000
36,6000,FINE ARTS,Arts,181514,2528,124246,80794,8072,0.061005,58000.0,...,247638,26518,0.067909,46000.0,30000,70000.0,0.251075,0.26087,68.449817,Above 100000
44,3401,LIBERAL ARTS,Humanities & Liberal Arts,213970,3420,154404,112953,6788,0.042111,70000.0,...,288616,27049,0.06528,50000.0,34000,75000.0,0.27097,0.4,72.161518,Above 100000
46,6003,VISUAL AND PERFORMING ARTS,Arts,18324,275,14841,9671,593,0.038422,53000.0,...,21969,3729,0.093842,40000.0,28000,60000.0,0.274016,0.325,80.992141,10000 to 50000
