Pull request #26

pdxrod · 2021-07-01T10:57:48Z

I'm trying to do a pull request for some files which I've added to this project. They are the Python files Chapter..N...py broken down into smaller files to make them easier to read. I couldn't see how to do a pull request unless I had write access to this repo, so I cloned, and created my own, at https://github.com/pdxrod/practical-statistics-for-data-scientists. I'll delete this repo if requested to do so by Peter Gedeck.

The main purpose of this branch (small-files) was to make it easier for me to read the book and understand it, being able to see the code in smaller sections, whereas the Chapter..N...py files are 395 lines on average.

gedeck · 2021-07-01T12:01:31Z

Thank you for the contribution. I'm reluctant to add it to the repository in the current form. However, we may come up with a way to do it. I have a second, private repository in addition to the public repository. The private repository contains only the note books and contains some additional code to create the figure files that were used for the book. Whenever I make changes to the code, I modify the private repository and then run a script that takes the note book, strips out the code that is book specific, creates notebooks and code files and runs each of them. On success, the files are copied to the public repository. This has the advantage, that I only need to update one file and create all of the others automatically. This is the reason why I would like to keep the R and Python directory und my full control.

Coming back to your suggestion. What we could do is have a contrib directory where we add code contributed by the community. This could be something like your contribution or variations of the code using different packages (e.g. ggplot and not base-R plotting, or building models in scikit-learn using pipelines). I would not take responsibility for maintaining the code in this directory.

What do you think of this?

pdxrod · 2021-07-01T12:53:57Z

Hi Peter - Thanks for your rapid reply. Yes, the idea of adding a contrib/ directory to your current code sounds like a good idea. If I have write access to it, I'll put my branch in it, and delete it from my pdxrod github. And I'll be asking you questions about the book as I go through it - slowly, this time. Rod McLaughlin‎ +90 535 736 03 69

…

On Thu, 1 Jul 2021 at 15:01, gedeck ***@***.***> wrote: Thank you for the contribution. I'm reluctant to add it to the repository in the current form. However, we may come up with a way to do it. I have a second, private repository in addition to the public repository. The private repository contains only the note books and contains some additional code to create the figure files that were used for the book. Whenever I make changes to the code, I modify the private repository and then run a script that takes the note book, strips out the code that is book specific, creates notebooks and code files and runs each of them. On success, the files are copied to the public repository. This has the advantage, that I only need to update one file and create all of the others automatically. This is the reason why I would like to keep the R and Python directory und my full control. Coming back to your suggestion. What we could do is have a contrib directory where we add code contributed by the community. This could be something like your contribution or variations of the code using different packages (e.g. ggplot and not base-R plotting, or building models in scikit-learn using pipelines). I would not take responsibility for maintaining the code in this directory. What do you think of this? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#26 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAC73OVELUFCKQ5YXRTYWLTVRKKLANCNFSM47UIM3TA> .

gedeck · 2021-07-01T13:40:14Z

There is no need to have write access. Instead of creating a new repository, you fork this one and make changes in the forked repository. You can then create a pull request from your forked repository into mine. Here is a screenshot that should explain how the fork can create a pull request to the original repository.

pdxrod · 2021-07-31T12:53:35Z

Peter - You probably already know this , but line 77 of Chapter 3 - Statistial Experiments and Significance Testing.py print(np.mean(perm_diffs > mean_b - mean_a)) should be print(np.mean(perm_diffs) > mean_b - mean_a) There's a typo in the filename too ;) Rod McLaughlin‎ +90 535 736 03 69

…

On Thu, 1 Jul 2021 at 16:40, gedeck ***@***.***> wrote: There is no need to have write access. Instead of creating a new repository, you fork this one and make changes in the forked repository. You can then create a pull request from your forked repository into mine. Here is a screenshot that should explain how the fork can create a pull request to the original repository. [image: image] <https://user-images.githubusercontent.com/8720575/124130928-48951980-da4d-11eb-86b0-41f6bb1a43c4.png> — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#26 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAC73KUCUD3246N7Q4T5MDTVRV4PANCNFSM47UIM3TA> .

gedeck · 2021-07-31T15:55:16Z

The command

print(np.mean(perm_diffs > mean_b - mean_a))

is correct. It probably warrants some explanation. perm_diffs is a vector of possible differences of means for A and B. mean_b - mean_a is a number, the actual difference between the means of A and B.

perm_diffs > mean_b - mean_a is a boolean vector of the same length of perm_diffs where we have True in the corresponding element of perm_diffs is greater than the actual difference of the means and False otherwise, e.g.

[True, True, False, ...., False, True]

Python and R (and a lot of other languages) interpret True as 1 and False as 0. Calculating the means of this vector gives me the percentage of True values. In the book this is 0.121. This is what we want to know.

print(np.mean(perm_diffs) > mean_b - mean_a)

on the other hand will print either True or False.

Thanks for spotting the typo in the filename. It is now corrected.

pdxrod · 2021-07-31T16:15:44Z

Hmm... Wrong Python? Wrong numpy? Thanks for your rapid responses $ python3 --version Python 3.8.5

>> numpy.version.version

'1.19.5' $ python3 Chapter\ 3\ -\ Statistial\ Experiments\ and\ Significance\ Testing.py ... Traceback (most recent call last): File "Chapter 3 - Statistial Experiments and Significance Testing.py", line 77, in <module> print(np.mean(perm_diffs > mean_b - mean_a)) TypeError: '>' not supported between instances of 'list' and 'float' Rod McLaughlin‎ +90 535 736 03 69

…

On Sat, 31 Jul 2021 at 18:55, gedeck ***@***.***> wrote: The command print(np.mean(perm_diffs > mean_b - mean_a)) is correct. It probably warrants some explanation. perm_diffs is a vector of possible differences of means for A and B. mean_b - mean_a is a number, the actual difference between the means of A and B. perm_diffs > mean_b - mean_a is a boolean vector of the same length of perm_diffs where we have True in the corresponding element of perm_diffs is greater than the actual difference of the means and False otherwise, e.g. [True, True, False, ...., False, True] Python and R (and a lot of other languages) interpret True as 1 and False as 0. Calculating the means of this vector gives me the percentage of True values. In the book this is 0.121. This is what we want to know. print(np.mean(perm_diffs) > mean_b - mean_a) on the other hand will print either True or False. Thanks for spotting the typo in the filename. It is now corrected. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#26 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAC73MVDDAA4ERDNOC2WA3T2QMG7ANCNFSM47UIM3TA> .

gedeck · 2021-07-31T16:49:24Z

This issue was reported before #23 but never got the versions. The problem is that mean_a and mean_b are float and not numpy.float64. The means come from pandas, so it must be an inconsistency with that version. Can you send your pandas version?

pdxrod · 2021-08-01T04:55:05Z

>> pandas.__version__

'1.1.3' Rod McLaughlin‎ +90 535 736 03 69

…

On Sat, 31 Jul 2021 at 19:49, gedeck ***@***.***> wrote: Someone reported the same issue before #23 <#23> but never got back to me with versions. The problem is that mean_a and mean_b are float and not numpy.float64. The means come from pandas, so it must be an inconsistency with that version. Can you send your pandas version? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#26 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAC73IEVRGPAASGQ2ZVT53T2QSR7ANCNFSM47UIM3TA> .

gedeck · 2021-08-01T17:12:52Z

My versions are: Python 3.9.4, numpy 1.20.2, and pandas 1.2.4

I looked at the various pandas release notes since 1.1.3 but couldn't pinpoint when it was fixed. There are several fixes related to regressions in type casting and it's likely that this was working before 1.1.3 and fixed again after.

I suggest you update pandas to a newer version.

pdxrod · 2021-08-02T04:13:24Z

You do need Python 3.9 to make that line work What version of *scipy* do you have? I can't get Pip to install it using Python 3.9 Rod McLaughlin‎ +90 535 736 03 69

…

On Sun, 1 Aug 2021 at 20:13, gedeck ***@***.***> wrote: My versions are: Python 3.9.4, numpy 1.20.2, and pandas 1.2.4 I looked at the various pandas release notes since 1.1.3 but couldn't pinpoint when it was fixed. There are several fixes related to regressions in type casting and it's likely that this was working before 1.1.3 and fixed again after. I suggest you update pandas to a newer version. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#26 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAC73KLJKIIADFFSQ7OE5TT2V6B5ANCNFSM47UIM3TA> .

gedeck · 2021-08-02T12:31:23Z

The scipy version that I use is scipy==1.7.0.

I just downgraded my pandas and numpy version to yours and the code still works. It could be an OS related issue. I can run the code on MacOS and Linux, but don't have windows to try it.

pdxrod · 2021-08-02T16:29:34Z

On my Mac M1, Pip under Python 3.9 wouldn't import any version of scipy Since it is just a one-line issue, I solved it in https://github.com/pdxrod/practical-statistics-for-data-scientists/blob/master/python/code/ch_3_01_resampling.py like this: def make_boolean_array_of_perm_diffs( perm_diffs, b_minus_a ): arr = [] for diff in perm_diffs: arr.append( diff > b_minus_a ) return arr boolean_array = make_boolean_array_of_perm_diffs( perm_diffs, mean_b - mean_a ) print( np.mean( boolean_array )) Rod McLaughlin‎ +90 535 736 03 69

…

On Mon, 2 Aug 2021 at 15:31, gedeck ***@***.***> wrote: The scipy version that I use is scipy==1.7.0. I just downgraded my pandas and numpy version to yours and the code still works. It could be an OS related issue. I can run the code on MacOS and Linux, but don't have windows to try it. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#26 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAC73NFY6NHCCECZTCEYF3T22F2LANCNFSM47UIM3TA> .

gedeck · 2021-08-02T17:01:51Z

Did you try:

print(np.mean(np.array(perm_diffs) > mean_b - mean_a))

pdxrod · 2021-08-03T03:32:15Z

That worked Thanks Rod McLaughlin‎ +90 535 736 03 69

…

On Mon, 2 Aug 2021 at 20:02, gedeck ***@***.***> wrote: Did you try: print(np.mean(np.array(perm_diffs) > mean_b - mean_a)) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#26 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAC73NJOQCOK6MXRHERTPLT23FQVANCNFSM47UIM3TA> .

pdxrod · 2021-08-09T13:14:54Z

Hi - I wonder if you've any recommendations from this list, for my next Python/ML/AI book? https://www.amazon.com/s?k=python+machine+learning&ref=nb_sb_noss Rod McLaughlin‎ +90 535 736 03 69

…

On Tue, 3 Aug 2021 at 06:31, Rod McLaughlin ***@***.***> wrote: That worked Thanks Rod McLaughlin‎ +90 535 736 03 69 On Mon, 2 Aug 2021 at 20:02, gedeck ***@***.***> wrote: > Did you try: > > print(np.mean(np.array(perm_diffs) > mean_b - mean_a)) > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#26 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAAC73NJOQCOK6MXRHERTPLT23FQVANCNFSM47UIM3TA> > . >

gedeck · 2021-08-09T13:57:41Z

I would pick the first one to start with.
https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1492032646/

There is also this book that is available online and in paper:
https://www.deeplearningbook.org/
https://mitpress.mit.edu/books/deep-learning

gedeck mentioned this issue Aug 1, 2021

Ch 3. Line 77 in Python Code #23

Closed

gedeck mentioned this issue Apr 25, 2022

Python code for Chapter 3 - Web Stickness - TypeError in the original code #36

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pull request #26

Pull request #26

pdxrod commented Jul 1, 2021

gedeck commented Jul 1, 2021

pdxrod commented Jul 1, 2021 via email

gedeck commented Jul 1, 2021

pdxrod commented Jul 31, 2021 via email

gedeck commented Jul 31, 2021

pdxrod commented Jul 31, 2021 via email

gedeck commented Jul 31, 2021 •

edited

Loading

pdxrod commented Aug 1, 2021 via email

gedeck commented Aug 1, 2021

pdxrod commented Aug 2, 2021 via email

gedeck commented Aug 2, 2021

pdxrod commented Aug 2, 2021 via email

gedeck commented Aug 2, 2021

pdxrod commented Aug 3, 2021 via email

pdxrod commented Aug 9, 2021 via email

gedeck commented Aug 9, 2021

Pull request #26

Pull request #26

Comments

pdxrod commented Jul 1, 2021

gedeck commented Jul 1, 2021

pdxrod commented Jul 1, 2021 via email

gedeck commented Jul 1, 2021

pdxrod commented Jul 31, 2021 via email

gedeck commented Jul 31, 2021

pdxrod commented Jul 31, 2021 via email

gedeck commented Jul 31, 2021 • edited Loading

pdxrod commented Aug 1, 2021 via email

gedeck commented Aug 1, 2021

pdxrod commented Aug 2, 2021 via email

gedeck commented Aug 2, 2021

pdxrod commented Aug 2, 2021 via email

gedeck commented Aug 2, 2021

pdxrod commented Aug 3, 2021 via email

pdxrod commented Aug 9, 2021 via email

gedeck commented Aug 9, 2021

gedeck commented Jul 31, 2021 •

edited

Loading