![](logo.png)

# <font color='red'>Introduction to Pandas Exercise Solutions</font>

The last two weeks focused on the Pandas library, specifically input of files to dataframes, indexing a selecting subsets of data, merging and joining, and output of dataframes to various files. This exercise is designed to review the materials coverd in the Week 4 and 5 lectures.

1. Import Pandas, Numpy and OS libraries

In [1]:
import pandas as pd
import numpy as np
import os

2. Read in the peanut_lines CSV file as **peanut_lines** (Use encoding='ISO-8859-1')

In [2]:
peanut_lines = pd.read_csv('peanut_lines.csv', encoding='ISO-8859-1')

3. Check the head of the files

In [3]:
peanut_lines.head()

Unnamed: 0,NC_Accession,Identity or Parentage,Pedigree,FAG
0,ACI WT09-0761,ACI WT09-0761,,Check
1,ACI WT11-0351,ACI WT11-0351,,ol ol
2,ACI WT12-0226,ACI WT12-0226,,ol ol
3,ACI WT12-0419,ACI WT12-0419,,Check
4,ACI WT12-0420,ACI WT12-0420,,ol ol


4. Use the .info() method to find out how many peanut_lines are in the dataframe

    **Bonus: Print the total number of entries

In [4]:
peanut_lines.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 859 entries, 0 to 858
Data columns (total 4 columns):
NC_Accession             466 non-null object
Identity or Parentage    466 non-null object
Pedigree                 409 non-null object
FAG                      466 non-null object
dtypes: object(4)
memory usage: 27.0+ KB


In [7]:
print(peanut_lines.shape[0]) #shape [0 = rows,1 = columns]

859


5. Read in the peanut_yield text file as **peanut_yield** (Use encoding='ISO-8859-1')

In [9]:
peanut_yield = pd.read_csv('peanut_yield.txt', sep ='\t', encoding='ISO-8859-1')

6. Check the column names of peanut_yield

In [10]:
peanut_yield.columns

Index(['Year', 'Location', 'Name', 'Label', 'NC_Accession', 'Plot_Yield',
       'Yield'],
      dtype='object')

7. Using 'NC_Accession', merge (or join) the two dataframes together keeping the entirety of the peanut_yield data. Name the new dataframe **peanut_data**

In [12]:
peanut_lines.sort_values('NC_Accession', inplace=True)
peanut_yield.sort_values('NC_Accession', inplace=True)
peanut_data = pd.merge(peanut_lines, peanut_yield, on='NC_Accession', how='right')

8. What is the average yield in 2017?

In [13]:
peanut_data['Yield'][peanut_data['Year'] == 2017].mean()

3981.6

9. What was the average yield of the top 10 lines tested in the peanut program?

In [16]:
peanut_data.groupby('NC_Accession').mean()['Yield'].sort_values(ascending=False).head(10)

5155.0666666666675

10. What are the top 10 most commonly tested lines? Hint: value_counts()

In [17]:
peanut_data['NC_Accession'].value_counts().head(10)

N11028        34
Sullivan      26
Emery         26
Sugg          26
Bailey        26
Wynne         26
Florida-07    20
N11020        20
Bailey II     20
N08085        18
Name: NC_Accession, dtype: int64

In [None]:
peanut_data.head()

11. Create a dataframe that satisfy the following requirements
<br>peanut lines = 'Bailey', 'Sullivan', 'Wynne', 'Emery', 'Bailey II', 'N14023'
<br>location = 'LEW', 'RMT' #Lewiston and Rocky Mount
<br>Dataframe = line_data

In [20]:
peanut_lines = ['Bailey','Sullivan','Wynne','Emery','Bailey II','N14023']
location = ['LEW', 'RMT']
line_data = peanut_data[(peanut_data['NC_Accession'].isin(peanut_lines)) & (peanut_data['Location'].isin(location))]

In [21]:
line_data

Unnamed: 0,NC_Accession,Identity or Parentage,Pedigree,FAG,Year,Location,Name,Label,Plot_Yield,Yield
16,Bailey,NC 12C*2 / N96076L,BC1F1-06-01-S-03-S-05: F09,+ +,2018,RMT,ATP,Advanced Testing Program - Yield,13.6,4104
17,Bailey,NC 12C*2 / N96076L,BC1F1-06-01-S-03-S-05: F09,+ +,2018,LEW,ATP,Advanced Testing Program - Yield,12.9,3915
19,Bailey,NC 12C*2 / N96076L,BC1F1-06-01-S-03-S-05: F09,+ +,2017,RMT,ATP,Advanced Testing Program - Yield,14.8,4480
20,Bailey,NC 12C*2 / N96076L,BC1F1-06-01-S-03-S-05: F09,+ +,2017,LEW,ATP,Advanced Testing Program - Yield,16.0,4841
22,Bailey,NC 12C*2 / N96076L,BC1F1-06-01-S-03-S-05: F09,+ +,2016,RMT,ATP,Advanced Testing Program - Yield,7.5,2262
...,...,...,...,...,...,...,...,...,...,...
2651,Wynne,Bailey*2 / Brantley,BC1F1-04-01-S-02-S-02: F09,ol ol,2012,LEW,ATP,Advanced Testing Program - Yield,12.7,3849
2653,Wynne,Bailey*2 / Brantley,BC1F1-04-01-S-02-S-02: F09,ol ol,2011,RMT,ATP,Advanced Testing Program - Yield,17.3,5235
2654,Wynne,Bailey*2 / Brantley,BC1F1-04-01-S-02-S-02: F09,ol ol,2011,LEW,ATP,Advanced Testing Program - Yield,14.5,4394
2656,Wynne,Bailey*2 / Brantley,BC1F1-04-01-S-02-S-02: F09,ol ol,2010,RMT,ATP,Advanced Testing Program - Yield,6.5,1966


12. Export the means of each line by location to an excel file called 'NCSU_release.xlsx'

In [22]:
line_data.groupby(['Location','NC_Accession']).mean()['Yield'].to_excel('NCSU_release.xlsx')