1. Use pandas to create a Series from the following data:

["kiwi", "mango", "strawberry", "pineapple", "gala apple", "honeycrisp apple", "tomato", "watermelon", "honeydew", "kiwi", "kiwi", "kiwi", "mango", "blueberry", "blackberry", "gooseberry", "papaya"]

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [6]:
# a. Name the variable that holds the series fruits.

fruits = pd.Series(["kiwi", "mango", "strawberry", "pineapple", "gala apple", "honeycrisp apple", "tomato", "watermelon", "honeydew", "kiwi", "kiwi", "kiwi", "mango", "blueberry", "blackberry", "gooseberry", "papaya"])
fruits

0                 kiwi
1                mango
2           strawberry
3            pineapple
4           gala apple
5     honeycrisp apple
6               tomato
7           watermelon
8             honeydew
9                 kiwi
10                kiwi
11                kiwi
12               mango
13           blueberry
14          blackberry
15          gooseberry
16              papaya
dtype: object

In [7]:
fruits_hasnan = pd.Series([None, "kiwi", "mango", "strawberry", "pineapple", "gala apple", "honeycrisp apple", "tomato", "watermelon", "honeydew", "kiwi", "kiwi", "kiwi", "mango", "blueberry", "blackberry", "gooseberry", "papaya"])
fruits_hasnan

0                 None
1                 kiwi
2                mango
3           strawberry
4            pineapple
5           gala apple
6     honeycrisp apple
7               tomato
8           watermelon
9             honeydew
10                kiwi
11                kiwi
12                kiwi
13               mango
14           blueberry
15          blackberry
16          gooseberry
17              papaya
dtype: object

In [8]:
# What is the data type of fruits?

fruits.dtypes

# It is a pandas series

dtype('O')

In [9]:
# Also can use type() to find out the data type of fruits

type(fruits)

pandas.core.series.Series

### b. Run .describe() on the series to see what describe returns for a series of strings.
`Series.describe()` for categorical Series returns:
    - Total counts, excluding NaN values
    - Number of unique elements
    - Most frequency element
    - Highest frequency    

In [10]:
fruits.describe()

count       17
unique      13
top       kiwi
freq         4
dtype: object

In [11]:
fruits_hasnan.describe()

count       17
unique      13
top       kiwi
freq         4
dtype: object

### c. Run the code necessary to produce only the unique fruit names.

### `Series.unique()`
    - return unique values of Series Object
    - uniques are returned in the same order as they apprear in the original series
    - the data type of the returned is np.array

In [12]:
unique_fruit = fruits.unique()
print(unique_fruit)
print(type(unique_fruit))
print(unique_fruit.size)

['kiwi' 'mango' 'strawberry' 'pineapple' 'gala apple' 'honeycrisp apple'
 'tomato' 'watermelon' 'honeydew' 'blueberry' 'blackberry' 'gooseberry'
 'papaya']
<class 'numpy.ndarray'>
13


### `Series.nunique()` 

returns number of unique elements/values

In [13]:
fruits.nunique()

13

### d. Determine how many times each value occurs in the series.

`Series.value_counts()`
- similar to `group by` in SQL, 
- return the counts for every unique values
- index is the unique values (very useful)

In [14]:
fruit_count = fruits.value_counts()
    
# A new series is returned and the element is the count (int) and the labels are the fruit names.

print(fruit_count)
print(type(fruit_count)) 
print(fruit_count.index)
type(fruit_count.index)

kiwi                4
mango               2
watermelon          1
gala apple          1
tomato              1
pineapple           1
honeycrisp apple    1
gooseberry          1
papaya              1
blackberry          1
honeydew            1
blueberry           1
strawberry          1
dtype: int64
<class 'pandas.core.series.Series'>
Index(['kiwi', 'mango', 'watermelon', 'gala apple', 'tomato', 'pineapple',
       'honeycrisp apple', 'gooseberry', 'papaya', 'blackberry', 'honeydew',
       'blueberry', 'strawberry'],
      dtype='object')


pandas.core.indexes.base.Index

### e. Determine the most frequently occurring fruit name from the series.

`Series.nlargest(n=5, keep='first')`
- Return the largest n elements
- Parameters:
    - n: int, defualt 5, return this many descending sorted values
    - keep: when there are duplicates values
        - First: return the first value among the duplicates
        - Last: return the last value among the duplicates
        - All: return all duplicates

In [23]:
# Review: return the most frequent element including duplicates

fruits.value_counts().nlargest(n=1, keep = "all")

kiwi    4
dtype: int64

In [78]:
# Subsetting through mask

max_count = fruit_count.max()

fruit_count == max_count

fruit_count[fruit_count == max_count]

kiwi    4
dtype: int64

In [25]:
# Reveiw: the default order is descending so the first one(s) is the most frequence element

fruits.value_counts().head(1)

kiwi    4
dtype: int64

### `idxmax()`

- Return the row label of the maximum value
- If multiple values equal the maximum, the first row label with that value is returned

In [26]:
# Review: since there is only one max, idxmas works

fruits.value_counts().nlargest(n=1, keep="all").idxmax()

'kiwi'

### f. Determine the least frequently occurring fruit name from the series.

### `Series.nsmallest()`
- Similar to Series.nlargest()

In [27]:
# Review: return all duplicates which equal to the minimum value
    
fruits.value_counts().nsmallest(n=1, keep="all")

watermelon          1
gala apple          1
tomato              1
pineapple           1
honeycrisp apple    1
gooseberry          1
papaya              1
blackberry          1
honeydew            1
blueberry           1
strawberry          1
dtype: int64

In [32]:
# Review: only return the name of the fruits

list(fruits.value_counts().nsmallest(n=1, keep="all").index)

['watermelon',
 'gala apple',
 'tomato',
 'pineapple',
 'honeycrisp apple',
 'gooseberry',
 'papaya',
 'blackberry',
 'honeydew',
 'blueberry',
 'strawberry']

In [35]:
# Same as above
fruits.value_counts().nsmallest(n=1, keep="all").index.to_list()

['watermelon',
 'gala apple',
 'tomato',
 'pineapple',
 'honeycrisp apple',
 'gooseberry',
 'papaya',
 'blackberry',
 'honeydew',
 'blueberry',
 'strawberry']

In [28]:
# Subsetting

min_count = fruit_count.min()
fruit_count[fruit_count == min_count]

watermelon          1
gala apple          1
tomato              1
pineapple           1
honeycrisp apple    1
gooseberry          1
papaya              1
blackberry          1
honeydew            1
blueberry           1
strawberry          1
dtype: int64

In [29]:
# Review

fruits.value_counts().tail(11)

watermelon          1
gala apple          1
tomato              1
pineapple           1
honeycrisp apple    1
gooseberry          1
papaya              1
blackberry          1
honeydew            1
blueberry           1
strawberry          1
dtype: int64

### g. Write the code to get the longest string from the fruits series.

In [37]:
# Subsetting through mask

max_len = fruits.str.len().max()
fruits[fruits.str.len() == max_len]

5    honeycrisp apple
dtype: object

### `Series.to_list()`

In [42]:
# Review: using a key with max. The Series needs to be converted to list first. 

# Convert the series to list

list_fruits = fruits.to_list()
print(list_fruits)


# use max with key = len

max(list_fruits, key = len)

['kiwi', 'mango', 'strawberry', 'pineapple', 'gala apple', 'honeycrisp apple', 'tomato', 'watermelon', 'honeydew', 'kiwi', 'kiwi', 'kiwi', 'mango', 'blueberry', 'blackberry', 'gooseberry', 'papaya']


'honeycrisp apple'

### h. Find the fruit(s) with 5 or more letters in the name.

In [43]:
# remove the space

fruits.str.replace(" ",'')

# Find out the length with 5 or more

fruits.str.replace(" ",'').str.len() >= 5

# subset

fruits[fruits.str.replace(" ",'').str.len() >= 5]

1                mango
2           strawberry
3            pineapple
4           gala apple
5     honeycrisp apple
6               tomato
7           watermelon
8             honeydew
12               mango
13           blueberry
14          blackberry
15          gooseberry
16              papaya
dtype: object

### i. Capitalize all the fruit strings in the series.

In [44]:


cap_fruit = fruits.str.capitalize()
cap_fruit

0                 Kiwi
1                Mango
2           Strawberry
3            Pineapple
4           Gala apple
5     Honeycrisp apple
6               Tomato
7           Watermelon
8             Honeydew
9                 Kiwi
10                Kiwi
11                Kiwi
12               Mango
13           Blueberry
14          Blackberry
15          Gooseberry
16              Papaya
dtype: object

### j. Count the letter "a" in all the fruits (use string vectorization)

In [46]:
count_a = fruits.str.count("a")
count_a

0     0
1     1
2     1
3     1
4     3
5     1
6     1
7     1
8     0
9     0
10    0
11    0
12    1
13    0
14    1
15    0
16    3
dtype: int64

### k. Output the number of vowels in each and every fruit.

In [84]:
fruits.str.lower().str.count(r'[aeiou]')

0     2
1     2
2     2
3     4
4     4
5     5
6     3
7     4
8     3
9     2
10    2
11    2
12    2
13    3
14    2
15    4
16    3
dtype: int64

In [85]:
fruity = pd.DataFrame({'fruits': fruits, 'vowel_count': vowel_counts})
fruity

Unnamed: 0,fruits,vowel_count
0,kiwi,2
1,mango,2
2,strawberry,2
3,pineapple,4
4,gala apple,4
5,honeycrisp apple,5
6,tomato,3
7,watermelon,4
8,honeydew,3
9,kiwi,2


### l.Use the .apply method and a lambda function 

In [47]:
# to find the fruit(s) containing two or more "o" letters in the name.

fruits.apply(lambda fruit: fruit if fruit.count('o') >= 2 else "")

# Empty column in the table

0               
1               
2               
3               
4               
5               
6         tomato
7               
8               
9               
10              
11              
12              
13              
14              
15    gooseberry
16              
dtype: object

In [52]:
# Review

# Create a boolean series of True if the value meets the condition of having two or more 'o' letters
# and False otherwise

mask = fruits.apply(lambda fruit: fruit.count('o') >= 2)
mask

0     False
1     False
2     False
3     False
4     False
5     False
6      True
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15     True
16    False
dtype: bool

In [53]:
# Subset by mask

fruits[mask]

6         tomato
15    gooseberry
dtype: object

### m. Write the code to get only the fruits containing "berry" in the name

### `Series.str.contains()`

In [55]:
fruits.str.contains('berry')

0     False
1     False
2      True
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13     True
14     True
15     True
16    False
dtype: bool

In [56]:
fruits[fruits.str.contains('berry')]

2     strawberry
13     blueberry
14    blackberry
15    gooseberry
dtype: object

### n. Write the code to get only the fruits containing "apple" in the name

In [58]:
fruits.str.contains('apple')

0     False
1     False
2     False
3      True
4      True
5      True
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
dtype: bool

In [59]:
fruits[fruits.str.contains('apple')]

3           pineapple
4          gala apple
5    honeycrisp apple
dtype: object

### o. Which fruit has the highest amount of vowels?

In [88]:
# define a fuction that count vowels

def count_vowel(word):
    word = word.lower()
    count = 0
    for letter in word:
        if letter in "aeiou":
            count += 1
    return count

# find out the highest amount of vowels

max_vowels = fruits.apply(count_vowel).max()

# create the mask

mask = fruits.apply(count_vowel) == max_vowels

# subset

fruits[mask]

5    honeycrisp apple
dtype: object

In [71]:
# Review:

# Get the count of vowels
vowel_counts = fruits.str.count(r'[aeiou]')
vowel_counts

0     2
1     2
2     2
3     4
4     4
5     5
6     3
7     4
8     3
9     2
10    2
11    2
12    2
13    3
14    2
15    4
16    3
dtype: int64

In [77]:
# Get the element(s) through mask and subset - Method 1
# Convert the vowel counts into boolean series

mask = fruits.str.count(r'[aeiou]') == vowel_counts.max()
fruits[mask]

5    honeycrisp apple
dtype: object

In [73]:
# Get the element(s) through index - Method 2

vowel_counts.nlargest(n=1,keep='all').index.to_list()

fruits.iloc[vowel_counts.nlargest(n=1,keep='all').index.to_list()]

5    honeycrisp apple
dtype: object

In [75]:
# Get the elemnts through index - Method 3

vowel_counts.idxmax()

fruits.iloc[vowel_counts.idxmax()]

'honeycrisp apple'

2. Use pandas to create a Series from the following data:

['$796,459.41', '$278.60', '$482,571.67', '$4,503,915.98', '$2,121,418.3', '$1,260,813.3', '$87,231.01', '$1,509,175.45', '$4,138,548.00', '$2,848,913.80', '$594,715.39', '$4,789,988.17', '$4,513,644.5', '$3,191,059.97', '$1,758,712.24', '$4,338,283.54', '$4,738,303.38', '$2,791,759.67', '$769,681.94', '$452,650.23']

In [None]:
data = pd.Series(['$796,459.41', '$278.60', '$482,571.67', '$4,503,915.98', '$2,121,418.3', '$1,260,813.3', '$87,231.01', '$1,509,175.45', '$4,138,548.00', '$2,848,913.80', '$594,715.39', '$4,789,988.17', '$4,513,644.5', '$3,191,059.97', '$1,758,712.24', '$4,338,283.54', '$4,738,303.38', '$2,791,759.67', '$769,681.94', '$452,650.23'])
data

In [None]:
# What is the data type of the series?

print(type(data))
type(data[0])

In [None]:
data.dtype

In [None]:
# Use series operations to convert the series to a numeric data type.

# remove the $ sign

data_num = data.str.replace("$","")

# remove the ","

data_num = data_num.str.replace(",","")

# convert to float

data_num = data_num.apply(lambda i: float(i))
data_num

In [None]:
# Review: 

data.str.replace("$","").str.replace(",","").astype('float')

In [None]:
# What is the maximum value? The minimum?

max = data_num.max()
print(max)

min = data_num.min()
print(min)

In [None]:
data_num.agg(['max','min'])

In [None]:
# Bin the data into 4 equally sized intervals and show how many values fall into each bin.

data_bin_4 = pd.cut(data_num, 4)

print(data_bin_4)

data_bin_4.value_counts()

In [None]:
data_num.plot.hist()
plt.title("Deposite Distribution")
plt.xlabel("Deposite")
plt.ylabel("Frequnce")
plt.xticks(rotation = 45)

In [None]:
# Plot a histogram of the data. Be sure to include a title and axis labels.

plt.figure(figsize = (8, 5))

data_bin_4.value_counts().plot.bar()
plt.title("Deposite Distribution")
plt.xlabel("Deposite range")
plt.ylabel("Frequnce")
plt.xticks(rotation = 10)

In [None]:
# Review

data_bin_4.value_counts().sort_index(ascending=False).plot(kind='barh')

3. Use pandas to create a Series from the following exam scores:

[60, 86, 75, 62, 93, 71, 60, 83, 95, 78, 65, 72, 69, 81, 96, 80, 85, 92, 82, 78]

In [None]:
scores = pd.Series([60, 86, 75, 62, 93, 71, 60, 83, 95, 78, 65, 72, 69, 81, 96, 80, 85, 92, 82, 78])
scores

In [None]:
# What is the minimum exam score? The max, mean, median?

scores.describe()

In [None]:
# Plot a histogram of the scores.

scores.plot.hist()

In [None]:
# Convert each of the numbers above into a letter grade. For example, 86 should be a 'B' and 95 should be an 'A'.

scores.apply(lambda score: "A" if score >= 88 else ("B" if score >= 80 else ("C" if score >= 67 else ("D" if score >= 60 else "F"))))

In [None]:
# Define bin edges

bin_edges = [0,70,75,80,90,101]

bin_labels = ["F", "D", "C", "B", "A"]

pd.cut(scores, bin_edges, bin)

In [None]:
# Write the code necessary to implement a curve. I.e. that grade closest to 100 should be converted to a 100, 
# and that many points should be given to every other score as well.

curve_points = (100 - scores).min()
curved_scores = scores + curve_points
curved_scores

scores.plot.hist(alpha = 0.5)
curved_scores.plot.hist(alpha = 0.5)

Use pandas to create a Series from the following string:

'hnvidduckkqxwymbimkccexbkmqygkxoyndmcxnwqarhyffsjpsrabtjzsypmzadfavyrnndndvswreauxovncxtwzpwejilzjrmmbbgbyxvjtewqthafnbkqplarokkyydtubbmnexoypulzwfhqvckdpqtpoppzqrmcvhhpwgjwupgzhiofohawytlsiyecuproguy'

In [None]:
original_text = pd.Series('hnvidduckkqxwymbimkccexbkmqygkxoyndmcxnwqarhyffsjpsrabtjzsypmzadfavyrnndndvswreauxovncxtwzpwejilzjrmmbbgbyxvjtewqthafnbkqplarokkyydtubbmnexoypulzwfhqvckdpqtpoppzqrmcvhhpwgjwupgzhiofohawytlsiyecuproguy')
original_text

In [None]:
text = pd.Series([char for char in original_text[0]])
text.describe()

In [None]:
# What is the most frequently occuring letter? Least frequently occuring?

# the describe already return the most frequently occuring letter.

char_count = text.value_counts()
min_count = char_count.min()

char_count[char_count == min_count]

In [None]:
# How many vowels are in the list?

text.apply(count_vowel).sum()

In [None]:
# How many consonants are in the list?

text.size - text.apply(count_vowel).sum()

In [None]:
# Create a series that has all of the same letters, but uppercased

original_text.str.upper()

In [None]:
# Create a bar plot of the frequencies of the 6 most frequently occuring letters.

six_most = text.value_counts().head(n=6)
six_most.plot.bar()
plt.xticks(rotation = 0)
plt.title("Top 6 most frequently occuring letters")
plt.xlabel("letters")
plt.ylabel("frequency")

5. Complete the exercises from https://gist.github.com/ryanorsinger/f7d7c1dd6a328730c04f3dc5c5c69f3a, but use pandas Series for the data structure instead of lists and use Series subsetting/indexing and vectorization options instead of loops and lists.

In [None]:
fruits = pd.Series(['mango', 'kiwi', 'strawberry', 'guava', 'pineapple', 'mandarin orange'])

numbers = pd.Series([2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 17, 19, 23, 256, -8, -4, -2, 5, -9])

In [None]:
# Exercise 1 - rewrite the above example code using list comprehension syntax. 
# Make a variable named uppercased_fruits to hold the output of the list comprehension. 
# Output should be ['MANGO', 'KIWI', etc...]

uppercased_fruits = fruits.str.upper()
list(uppercased_fruits)

In [None]:
# Exercise 2 - create a variable named capitalized_fruits and use list comprehension syntax 
# to produce output like ['Mango', 'Kiwi', 'Strawberry', etc...]

capitalized_fruits = fruits.str.capitalize()
list(capitalized_fruits)

In [None]:
# Exercise 3 - Use a list comprehension to make a variable named fruits_with_more_than_two_vowels. 
# Hint: You'll need a way to check if something is a vowel.

fruits_with_more_than_two_vowels = fruits[fruits.apply(count_vowel) > 2]
list(fruits_with_more_than_two_vowels)

In [None]:
# Exercise 4 - make a variable named fruits_with_only_two_vowels. 
# The result should be ['mango', 'kiwi', 'strawberry']

fruits_with_only_two_vowels = fruits[fruits.apply(count_vowel) == 2]
list(fruits_with_only_two_vowels)

In [None]:
# Exercise 5 - make a list that contains each fruit with more than 5 characters

list(fruits[fruits.str.len() > 5])

In [None]:
# Exercise 6 - make a list that contains each fruit with exactly 5 characters

list(fruits[fruits.str.len() == 5])

In [None]:
# Exercise 7 - Make a list that contains fruits that have less than 5 characters

list(fruits[fruits.str.len() < 5])

In [None]:
# Exercise 8 - Make a list containing the number of characters in each fruit. 
# Output would be [5, 4, 10, etc... ]

fruits.str.len()
list(fruits.str.len())

In [None]:
# Exercise 9 - Make a variable named fruits_with_letter_a that contains a list of only the fruits 
# that contain the letter "a"


fruits_with_letter_a = fruits.apply(lambda fruit: fruit if "a" in fruit else "")
fruits_with_letter_a
list(fruits_with_letter_a)

In [None]:
# 10 - Make a variable named even_numbers that holds only the even numbers 

even_numbers = numbers[numbers % 2 == 0]
even_numbers
list(even_numbers)

In [None]:
# 11 - Make a variable named odd_numbers that holds only the odd numbers

odd_numbers = numbers[numbers % 2 == 1]
odd_numbers
list(odd_numbers)

In [None]:
#  12 - Make a variable named positive_numbers that holds only the positive numbers

positive_numbers = numbers[numbers > 0]
positive_numbers
list(positive_numbers)

In [None]:
# 13 - Make a variable named negative_numbers that holds only the negative numbers

negative_numbers = numbers[numbers < 0]
negative_numbers
list(negative_numbers)

In [None]:
# 14 - use a list comprehension w/ a conditional in order to produce a list of numbers with 2 or more numerals

numbers[(numbers >= 10) | (numbers <= -10)]
list(numbers[(numbers >= 10) | (numbers <= -10)])

In [None]:
# Exercise 15 - Make a variable named numbers_squared that contains the numbers list with each element squared.
# Output is [4, 9, 16, etc...]

numbers_squared = numbers**2
numbers_squared
print(list(numbers_squared))

In [None]:
# Exercise 16 - Make a variable named odd_negative_numbers that contains only the numbers 
# that are both odd and negative.

odd_negative_numbers = numbers[(numbers < 0) & (numbers % 2 == 1)]
odd_negative_numbers
list(odd_negative_numbers)

In [None]:
# Exercise 17 - Make a variable named numbers_plus_5. In it, return a list containing each number plus five. 

numbers_plus_5 = numbers + 5
numbers_plus_5
list(numbers_plus_5)

In [None]:
# BONUS Make a variable named "primes" that is a list containing the prime numbers in the numbers list. 
# *Hint* you may want to make or find a helper function that determines if a given number is prime or not.

def check_prime(x):
    if x > 1:
        for i in range (2, x):
            if (x % i) == 0:
                return False
                break
        else:
            return True
    else:
        return True
    
mask = numbers.apply(check_prime)
numbers[mask]
list(numbers[mask])