# Pandas Series Exercises

## Use pandas to create a Series from the following data:

```python
["kiwi", "mango", "strawberry", "pineapple", "gala apple", "honeycrisp apple", "tomato", "watermelon", "honeydew", "kiwi", "kiwi", "kiwi", "mango", "blueberry", "blackberry", "gooseberry", "papaya"]```

In [None]:
import pandas as pd #Convention is to import the pandas module with the alias pd.

In [None]:
fruits = pd.Series(["kiwi", "mango", "strawberry", "pineapple", "gala apple", "honeycrisp apple", "tomato", "watermelon", "honeydew", "kiwi", "kiwi", "kiwi", "mango", "blueberry", "blackberry", "gooseberry", "papaya"])
type(fruits)
# By using the pd.Series( data, index, dtype, copy) function, we turn our list of strings into a pandas Series
# data = our list
# index is not provided, so it defaults to range(len(array))-1
# dtype is not provided, so it is inferred from the data. Because the data contains strings, the dtype will be object
# copy is not provided, so it defaults to False, thus any manipulation of the variable fruits will affect the original compounded values in system memory

In [None]:
fruits.name = 'fruits' #Just for fun we renamed the name of this series to 'fruits'
fruits 

#### Run `.describe()` on the series to see what describe returns for a series of strings.

In [None]:
fruits.describe() #Because the series dtype is object, it will provide us with the following
#count - the number of elements in the series
#unique - the number of unique values (count - unique = number of repeated values)
#top - the most common value
#freq - the frequency for the most common value. Kiwi is found in the series four times
#Note: If multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count

#### Run the code necessary to produce only the unique fruit names.

In [None]:
fruits.unique() # Returns the unique values as a NumPy array.
# Uniques are returned in order of appearance.

#### Determine how many times each value occurs in the series.

In [None]:
freq = fruits.value_counts() #value_counts creates a series with the index = unique values of the original series and the values are their frequency in the original list
print(freq)

#### Determine the most frequently occurring fruit name from the series.

In [None]:
fruits.mode().iloc[0] #fruits.mode() creates a series consisting of a single element. The index is 0 and the value is the element that had the highest frequency in the original series.
#by using .iloc[] and passing in an index of 0, we can obtain the data in a simple string format

#### Determine the least frequently occurring fruit name from the series.

In [None]:
# We can create a function that will print out all of the least frequently occuring fruits
def least_frequent(series):
    freq = fruits.value_counts() 
    for i in range(len(freq)):
        if freq[i] == freq[-1]:
            print(freq.index[i])
            
least_frequent(fruits)

In [None]:
# Or we can use the .nsmallest() to select only the smallest values
fruits.value_counts().nsmallest(keep = 'all')
# .nsmalles(n, keep= ) 
# n is the number of smallest values we wish to keep. It is defaulted to 5
# keep = 'first', 'last', 'all' - These handle how duplicate values might be selected
# first = return the first n occurrences in order of appearance.
# last = return the last n occurrences in reverse order of appearance.
# all = keep all occurrences. This can result in a Series of size larger than n.

#### Write the code to get the longest string from the fruits series.

In [None]:
string_lengths = fruits.str.len() #using series.str.len() creates output of a series where each value is the length of the values in the original series
string_lengths

In [None]:
fruits.str.len().max()

In [None]:
fruits_mask = fruits.str.len() == fruits.str.len().max()
fruits_mask

In [None]:
fruits[fruits_mask]

In [None]:
string_lengths.idxmax() # We can use .idmax() on our series to find the index where the value is the highest

In [None]:
fruits[string_lengths.idxmax()] # Then we can pass that index to our original array

In [None]:
longest_string = max(fruits, key=len) #We can also use the max function on our series, and by passing in key=len, it tells python that we lengths to be the value that it applies max to
longest_string

In [None]:
# While these methods work, there might be some issues if we have a tie for the longest string.
# Future improvements could create output of all matching amounts. 
string_lengths.nlargest(1, keep='all')
# .nlargest(1, keep='all') is a good start, as it produces output that captures ties
# We can then pass the indices of that output to our original series and identify the strings that are tied for the longest

#### Find the fruit(s) with 5 or more letters in the name.

In [None]:
fruits.str.len() >= 5 # We can find an applicable boolean mask by using series.str.len() with a comparison operator (in this case, > )

In [None]:
fruits[fruits.str.len() >= 5] # We can pass that boolean mask to the original series and only output values that meet the condition

#### Capitalize all the fruit strings in the series.

In [None]:
fruits.str.capitalize() #series.str.capitalize() does this simply enough

#### Count the letter "a" in all the fruits (use string vectorization)

In [None]:
fruits.str.count('a') #series.str.count() can be passed a string and instances of that string will be counted 

#### Output the number of vowels in each and every fruit.

In [None]:
fruits.str.lower().str.count(r'[aeiou]')

In [None]:
fruits.str.lower().str.count('[aeiou]')

#### Use the .apply method and a lambda function to find the fruit(s) containing two or more "o" letters in the name.

In [None]:
fruits[fruits.apply(lambda fruit: fruit.count('o') >= 2)]

#### Write the code to get only the fruits containing "berry" in the name

In [None]:
fruits[fruits.apply(lambda fruit: fruit.count('berry') > 0)]

#### Write the code to get only the fruits containing "apple" in the name

In [None]:
fruits[fruits.apply(lambda fruit: fruit.count('apple') > 0)]

#### Which fruit has the highest amount of vowels?

In [None]:
fruits[max(fruits.str.lower().str.count('[aeiou]'))]

## Use pandas to create a Series from the following data:

```python
['$796,459.41', '$278.60', '$482,571.67', '$4,503,915.98', '$2,121,418.3', '$1,260,813.3', '$87,231.01', '$1,509,175.45', '$4,138,548.00', '$2,848,913.80', '$594,715.39', '$4,789,988.17', '$4,513,644.5', '$3,191,059.97', '$1,758,712.24', '$4,338,283.54', '$4,738,303.38', '$2,791,759.67', '$769,681.94', '$452,650.23']```

In [None]:
money = pd.Series(['$796,459.41', '$278.60', '$482,571.67', '$4,503,915.98', '$2,121,418.3', '$1,260,813.3', '$87,231.01', '$1,509,175.45', '$4,138,548.00', '$2,848,913.80', '$594,715.39', '$4,789,988.17', '$4,513,644.5', '$3,191,059.97', '$1,758,712.24', '$4,338,283.54', '$4,738,303.38', '$2,791,759.67', '$769,681.94', '$452,650.23'])

In [None]:
money

#### What is the data type of the series?

In [None]:
money.dtype
# 'O' = object

#### Use series operations to convert the series to a numeric data type.

In [None]:
money_clean = money.str.replace('$', '', regex=True)
money_clean = money_clean.str.replace(',', '', regex=True)
money_clean = money_clean.astype('float')
money_clean

#### What is the maximum value? The minimum?

In [None]:
print(f"The maximum value was {money_clean.max()}.")
print(f"The minimum value was {money_clean.min()}.")

#### Bin the data into 4 equally sized intervals and show how many values fall into each bin.

In [None]:
pd.cut(money_clean, 4)
qc = pd.cut(money_clean, 4)
print(qc)
print(qc.value_counts())

In [None]:
q = pd.qcut(money_clean, 4)
qcount = q.value_counts()
print(q)
print(qcount)

#### Plot a histogram of the data. Be sure to include a title and axis labels.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
plt.title('Distribution of Currency Values')
plt.xlabel('Amount')
money_clean.plot.hist(grid = True)

#### Use pandas to create a Series from the following exam scores:

```python
[60, 86, 75, 62, 93, 71, 60, 83, 95, 78, 65, 72, 69, 81, 96, 80, 85, 92, 82, 78]
```

In [None]:
scores = pd.Series([60, 86, 75, 62, 93, 71, 60, 83, 95, 78, 65, 72, 69, 81, 96, 80, 85, 92, 82, 78])

In [None]:
print(f"The lowest score was {scores.min()}.")
print(f"The highest score was {scores.max()}.")

#### Plot a histogram of the scores.

In [None]:
plt.title('Distribution of Grades')
plt.xlabel('Grade')
plt.ylabel('Frequency')
scores.hist()

In [None]:
plt.title('Distribution of Grades')
plt.xlabel('Grade')
plt.ylabel('Frequency')
scores.hist(bins = [50, 60, 70, 80, 90, 100],rwidth = .8)

#### Convert each of the numbers above into a letter grade. For example, 86 should be 'B' and 95 should be 'A'.

In [None]:
scores_to_letter = pd.cut(scores, [0, 60, 70, 80, 90, 100], labels = ["F", "D", "C", "B", "A"])
scores_to_letter

#### Write the code necessary to implement a curve. I.e. that grade closest to 100 should be converted to a 100, and that many points should be given to every other score as well.

In [None]:
curve = 100 - scores.max()
scores_curved = scores + curve
scores_curved

## Use pandas to create a Series from the following string:

```python
'hnvidduckkqxwymbimkccexbkmqygkxoyndmcxnwqarhyffsjpsrabtjzsypmzadfavyrnndndvswreauxovncxtwzpwejilzjrmmbbgbyxvjtewqthafnbkqplarokkyydtubbmnexoypulzwfhqvckdpqtpoppzqrmcvhhpwgjwupgzhiofohawytlsiyecuproguy'
```

#### What is the most frequently occuring letter? 

In [None]:
letters = list('hnvidduckkqxwymbimkccexbkmqygkxoyndmcxnwqarhyffsjpsrabtjzsypmzadfavyrnndndvswreauxovncxtwzpwejilzjrmmbbgbyxvjtewqthafnbkqplarokkyydtubbmnexoypulzwfhqvckdpqtpoppzqrmcvhhpwgjwupgzhiofohawytlsiyecuproguy')

In [None]:
letters = pd.Series(letters)

In [None]:
letters_counts = letters.value_counts()
letters_counts.iloc[[0]]

In [None]:
top = letters_counts.head(1)
value, count = top.index[0], top.iat[0]
print(value)
print(count)

#### Least frequently occuring letter?

In [None]:
letters_counts.iloc[[-1]]

In [None]:
bottom = letters_counts.tail(1)
least_value, least_count = bottom.index[0], bottom.iat[0]
print(least_value)
print(least_count)

#### How many vowels are in the list?

In [None]:
sum(letters.str.lower().str.count('[aeiou]'))

#### How many consonants are in the list?

In [None]:
sum(letters.str.lower().str.count('[^aeiou]'))

In [None]:
sum(letters.str.lower().str.count('[bcdfghjklmnpqrstvwxyz]'))

#### Create a series that has all of the same letters, but uppercased

In [None]:
letters_upper = letters.str.upper()
letters_upper

#### Create a bar plot of the frequencies of the 6 most frequently occuring letters.

In [None]:
top_6 = letters_counts.nlargest(6)
top_6

In [None]:
top_6_bar = top_6.plot.bar(rot=0, title="6 Most Frequently Occuring Letters")
top_6_bar.set_ylabel('Frequency')
top_6_bar.set_xlabel('Letters')
plt.show()

## Complete the exercises from https://gist.github.com/ryanorsinger/f7d7c1dd6a328730c04f3dc5c5c69f3a, but use pandas Series for the data structure instead of lists and use Series subsetting/indexing and vectorization options instead of loops and lists.

In [None]:
fruits2 = pd.Series(['mango', 'kiwi', 'strawberry', 'guava', 'pineapple', 'mandarin orange'])

numbers2 = pd.Series([2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 17, 19, 23, 256, -8, -4, -2, 5, -9])

#### Exercise 1 - Make a variable named uppercased_fruits to hold the series. Output should be ['MANGO', 'KIWI', etc...]

In [None]:
uppercased_fruits = fruits2.str.upper()
uppercased_fruits

#### Exercise 2 - create a variable named capitalized_fruits and use vectorization options to produce output like ['Mango', 'Kiwi', 'Strawberry', etc...]

In [None]:
capitalized_fruits = fruits2.str.capitalize()
capitalized_fruits

#### Exercise 3 - Use vectorization options to make a variable named fruits_with_more_than_two_vowels. Hint: You'll need a way to check if something is a vowel.

In [None]:
from collections import Counter
mask = fruits2.map(lambda c: sum([Counter(c.lower()).get(i, 0) for i in list('aeiou')]) > 2)
fruits_with_more_than_two_vowels = fruits2[mask]
fruits_with_more_than_two_vowels

In [None]:
mask = fruits2.str.lower().str.count('[aeiou]') > 2
fruits_with_more_than_two_vowels = fruits2[mask]
fruits_with_more_than_two_vowels

#### # Exercise 4 - make a variable named fruits_with_only_two_vowels. The result should be ['mango', 'kiwi', 'strawberry']

In [None]:
mask = fruits2.map(lambda c: sum([Counter(c.lower()).get(i, 0) for i in list('aeiou')]) == 2)
fruits_with_only_two_vowels = fruits2[mask]
fruits_with_only_two_vowels

In [None]:
mask = fruits2.str.lower().str.count('[aeiou]') == 2
fruits_with_only_two_vowels = fruits2[mask]
fruits_with_only_two_vowels

#### Exercise 5 - make a series that contains each fruit with more than 5 characters

In [None]:
fruits2[fruits2.str.len() >= 5]

#### Exercise 6 - make a series that contains each fruit with exactly 5 characters

In [None]:
fruits2[fruits2.str.len() == 5]

#### Exercise 7 - Make a series that contains fruits that have less than 5 characters

In [None]:
fruits2[fruits2.str.len() < 5]

#### Exercise 8 - Make a series containing the number of characters in each fruit. Output would look like [5, 4, 10, 5, 9, 15]

```python
fruits2 = pd.Series(['mango', 'kiwi', 'strawberry', 'guava', 'pineapple', 'mandarin orange'])
```

In [128]:
fruits2.str.len()

0     5
1     4
2    10
3     5
4     9
5    15
dtype: int64

In [129]:
type(fruits2.str.len())

pandas.core.series.Series

In [130]:
list(fruits2.str.len())

[5, 4, 10, 5, 9, 15]

In [131]:
type(list(fruits2.str.len()))

list

In [134]:
example = pd.Series([52,4,3])
print(example)
#example.str.len()

0    52
1     4
2     3
dtype: int64


In [136]:
example = pd.Series([52,4,3])
print(example)
#example.astype(str).str.len()

0    52
1     4
2     3
dtype: int64


In [138]:
example = pd.Series(['52', 4, 3, '52'])
print(example)
example.str.len()

0    52
1     4
2     3
3    52
dtype: object


0    2.0
1    NaN
2    NaN
3    2.0
dtype: float64

#### Exercise 9 - Make a variable named fruits_with_letter_a that contains a series of only the fruits that contain the letter "a"

In [None]:
fruits_with_letter_a = fruits2[fruits2.str.count('a') > 0]
fruits_with_letter_a

#### Exercise 10 - Make a variable named even_numbers that holds only the even numbers 

In [None]:
even_numbers = numbers2[numbers2 % 2 == 0]
even_numbers

#### Exercise 11 - Make a variable named odd_numbers that holds only the odd numbers

In [None]:
odd_numbers = numbers2[numbers2 % 2 == 1]
odd_numbers

#### Exercise 12 - Make a variable named positive_numbers that holds only the positive numbers

In [None]:
positive_numbers = numbers2[numbers2 > 0]
positive_numbers

#### Exercise 13 - Make a variable named negative_numbers that holds only the negative numbers

In [None]:
negative_numbers = numbers2[numbers2 < 0]
negative_numbers

#### Exercise 14 - use vectorized operations in order to produce a series of numbers with 2 or more numerals

In [None]:
numbers_with_2_or_more_numerals = numbers2[(numbers2 >= 10) | (numbers2 <= -10)]
numbers_with_2_or_more_numerals

#### Exercise 15 - Make a variable named numbers_squared that contains the numbers list with each element squared. Output is [4, 9, 16, etc...]

In [None]:
numbers_squared = numbers2 ** 2
numbers_squared

#### Exercise 16 - Make a variable named odd_negative_numbers that contains only the numbers that are both odd and negative.



In [None]:
odd_negative_numbers = numbers2[(numbers2 % 2 == 1) & (numbers2 < 0)]
odd_negative_numbers

#### Exercise 17 - Make a variable named numbers_plus_5. In it, return a series containing each number plus five. 

In [None]:
numbers_plus_5 = numbers2 + 5
numbers_plus_5

#### BONUS: Make a variable named "primes" that is a series containing the prime numbers in the numbers list. *Hint* you may want to make or find a helper function that determines if a given number is prime or not.

In [None]:
def prime_number_detector(num):
    if num > 1:
        if num == 2:
            return True
        for i in range(2, num):
            if (num % i) == 0:
                return False 
        else: 
            return True
    else: 
        return False

In [None]:
mask = numbers2.apply(prime_number_detector)
primes = numbers2[mask]
primes

In [None]:
primes = numbers2[numbers2.apply(lambda n: all(n % i != 0 for i in range(2, n)) and n > 0)]
primes
#note to self: explain all() function
#The following lines of code are tinkering with changes in the lambda function

In [None]:
primes = numbers2[numbers2.apply(lambda n: [True, False, False] and n > 0)]
primes

In [None]:
primes = numbers2[numbers2.apply(lambda n: tuple(n % i != 0 for i in range(2, n)) and n > 0)]
primes

In [None]:
x = lambda n: (n % i != 0 for i in range(2, n))
list(x(4))

In [None]:
primes = numbers2[numbers2.apply(lambda n: True and n > 0)]
primes

In [None]:
primes = numbers2[numbers2.apply(lambda n: (0) and n > 0)]
primes

In [None]:
# The expression (n % i != 0 for i in range(2, n)) in the context of the lambda 
# expression creates a generator object. The generator object needs to be 
# typcast/stored as either a list of a tuple before it can be used in any meaningful 
# way. The all() function does this, and then further checks if all of the boolean 
# values in that iterable are True, meaning that there was no perfect division found 
# at each iteration, and with that returns a single boolean that can be used in the 
# lambda function. If we tried to typecast the generator object using list() instead, 
# we end up with TypeError: unhashable type: 'list', as pandas is attempting to temporarily
# use elements of the list as a indices in key, value pairs deep in the pandas code and lists are not hashable (i.e. they can be modified).  
# If we try to typecast the generator object using tuple(), we do get outputbecause tuples are immutable, but the output is really messed up.
# Furthermore, if we even were successful at typecasting the generator object, its putting
# a list object or a tuple object where a bool object needs to be so that our boolean operator 
# can check for True or False identity. Tuples are hashable, so pandas is
# able to evaluate the expression with the tuple (if not the way we intend). If anything other than a 0 is in that
# tuple, then it evaluates it as True. If 0 is in that tuple or if there is nothing in that tuple,
# then the output just gets crazy again. 
# In summary, the all() function does a great job of turning our loop into a single
# boolean value that python can use to determine if the value in question should be added or not. 
# Deeper knowledge of where things got buggy with tuples and lists are ultimately lost on me. 

# END